<a name="top" id="top"></a>

<div align="center">
    <h1>CHE597 - Iterative Development</h1>
    <a href="https://github.com/bernalde">David E. Bernal Neira</a>
    <br>
    <i>Davidson School of Chemical Engineering, Purdue University</i>
    <br>
    <a href="https://colab.research.google.com/github/SECQUOIA/PU_CHE597_S2025/blob/main/5b-Iterative_Code_Development/iterative_development.ipynb" target="_parent">
        <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
    </a>
    <a href="https://secquoia.github.io/">
        <img src="https://img.shields.io/badge/🌲⚛️🌐-SECQUOIA-blue" alt="SECQUOIA"/>
    </a>
</div>

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

In [None]:
# If using this on Google colab, we need to install the packages
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False


<b>If you are using google colab you should save this notebook and any associated textfiles to their own folder on your google drive. Then you will need to adapt the following commands so that the notebook runs from the location of that folder.</b>


In [None]:
# If you want to use Google Drive to save/load files, set this to True
USE_GOOGLE_DRIVE = False
if IN_COLAB and USE_GOOGLE_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')

    # Colab command to navigate to the folder holding the homework,
    # CHANGE FOR YOUR SPECIFIC FOLDER LOCATION IN GOOGLE DRIVE
    # Note: if there are spaces in the path, you need to precede them with a backslash '\'
    %cd /content/drive/My\ Drive/CHE597/In-Class_Examples/HW2

# Iterative Development

When you are writing complex programs, or even just multi-step functions or loops, it is critical that you break things into digestible chunks. The term <b>iterative development</b> describes this necessity to iterate on a piece of code, until it does everything you want. 

On HW 2, I asked you to write two functions that required several steps each: writing a window average function, and writing a parser for a file with chained data for multiple reactors. In each of these problems you had to write functions that did several things at once. But starting from scratch, it is very difficult to write something that does everything on the first try. It is much easier to write pieces of the functions, test as you go, and incrementally build up the coding solution. In this notebook, I will show you an example of how iterative development works. 

Here is the `multi_reactor.txt` parsing problem from the homework:

In [2]:
# The file mult_reactors.txt have time vs impurity data for several reactors
# Inspect the file.
# Write your own parser that reads each reactor's data into a separate array
# and assigns these arrays to a dictionary with pointer reactors.
# NOTE: np.genfromtxt() and np.loadtxt() will not work for this task
import numpy as np

This problems asks us to 1) open a file for reading and iteration, 2) distinguish between different reactors, and 3) save the data is a specific format (arrays within a dictionary).

Taking an iterative approach we will break this down into elementary tasks. Specifically, we will start with reading data, then just try and parse the first reactor's data, before finally trying to parse all of the reactors at once. Eventually we will arrive at a relatively sophisticated solution, but each of the individual steps are quite straightforward.  

 Let's first write a solution for opening the file and iterating over its contents:

In [None]:
with open("mult_reactors.txt",'r') as f:
  for count, lines in enumerate(f):
    print(lines)

    # For diagnostic purposes
    if count == 10:
      break    

The code above is responsible for opening the file and reading it line by line. We've also used the `count/enumerate` construction to test it on the first few lines. We won't keep this around in the final solution, but putting in this kind of diagnostic scaffolding is important while we're developing our solution. Now, let's add code so that we only get the parts of the file that are data (as opposed to all of the strings and headers):

In [None]:
with open("mult_reactors.txt",'r') as f:
  for count, lines in enumerate(f):
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Get the numbers
    try:
      print("{} {}".format(float(fields[0]),float(fields[1]))) # only numbers will survive this try
    except:
      pass

    # For diagnostic purposes
    if count == 10:
      break 

So far so good, we are (1) reading the file and (2) only grabbing the data. Now, we want to save the data, not just print it. Since we are reading in line by line, we will need to use an object that is good for appending values, like a list:

In [None]:
data = []
with open("mult_reactors.txt",'r') as f:
  for count, lines in enumerate(f):
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Get the numbers
    try:
      data += [float(fields[0]),float(fields[1])]
    except:
      pass

    # For diagnostic purposes
    if count == 10:
      break 
print(data)

Now we are saving the data, but we can see a problem: it isn't keeping the rows separate in the list, it is just combining all of the rows. To keep the rows separate, it is better to save the data as a list of lists, or list of tupels:

In [None]:
data = []
with open("mult_reactors.txt",'r') as f:
  for count, lines in enumerate(f):
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

    # For diagnostic purposes
    if count == 10:
      break 
print(data)

Now, the data from row `0` is in the list at index `0`, and so on. 

Next we know that there are several indpendent reactors. Each separated by a header. Let's add the code necessary to <b>only get the data from the first reactor</b>. This is a good interative strategy. After we get the first reactor working, we can deal with the other reactors.

To only get the first reactor's data we will add a "counter" or "iteration" variable `N_reactor`, to keep track of which reactor we are parsing, and we will need to add an `if` statment of some kind to figure out when the reactor's data ends. We will also update our `break` statement to exit the loop after we think that we have the first reactor's data:

In [None]:
data = [] # will temporarily hold each reactor's data
N_reactor = -1 # keeps track of which reactor we are parsing
with open("mult_reactors.txt",'r') as f:
  for lines in f:
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Identify start of new reactor data 
    if fields[0] == "Time(min)":
      N_reactor += 1

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

    # For diagnostic purposes
    if N_reactor == 1:
      break
print(data)

We've added a way of finding the end of the reactor's data by using the header entry `"Time(min)"`. Every time that the parser encounters such a line, a new reactor is about to be parsed. We've also updated our diagnostic `break` statement to break after the first reactor has been parsed (i.e., when `N_reactor == 1`) and we've removed the `enumerate()` function because it isn't needed any longer.

However, we've encountered a problem: `IndexError: list index out of range`. This means that we've tried to access an element `fields[0]` which is out of range for the list. Since `0` is the first element, this means that during this iteration the list was empty. Let's add a `print` statement to see what was happening right before the failure:

In [None]:
data = [] # will temporarily hold each reactor's data
N_reactor = -1 # keeps track of which reactor we are parsing
with open("mult_reactors.txt",'r') as f:
  for lines in f:
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # FIND OUT WHAT WAS HAPPENING BEFORE THE ERROR
    print(fields)

    # Identify start of new reactor data
    if fields[0] == "Time(min)":
      N_reactor += 1

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

    # For diagnostic purposes
    if N_reactor == 1:
      break
#print(data)

The printout is long. This is because the parser worked for a long time before encountering an error. Looking at the last printed statement before the error, we see an empty list (`[]`). Inspecting the file, we would see that there is an empty line between the reactors. Now that we see it, we can easily deal with this by skipping empty lines. Let's try again:

In [None]:
data = [] # will temporarily hold each reactor's data
N_reactor = -1 # keeps track of which reactor we are parsing
with open("mult_reactors.txt",'r') as f:
  for lines in f:
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Skip empty
    if len(fields) == 0:
      continue

    # Identify start of new reactor data
    if fields[0] == "Time(min)":
      N_reactor += 1

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

    # For diagnostic purposes
    if N_reactor == 1:
      break
print("length of data: {}".format(len(data)))
print("data[0]: {}".format(data[0]))
print("data[-1]: {}".format(data[-1]))

Skipping empty lines has now solved out problem. The code executes with error through the first reactor, and we have successfuly saved all of its data to `data`. 

Before parsing the next reactor, we need to put the first reactor's data somewhere. Specifically, from the problem we know what we should save it as an array within a dictionary. Let's add this functionality before trying to parse the rest of the reactors. 

In [None]:
data = [] # will temporarily hold each reactor's data
N_reactor = -1 # keeps track of which reactor we are parsing
reactors = {} # dictionary for holding each reactor's data as an array
with open("mult_reactors.txt",'r') as f:
  for lines in f:
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Skip empty
    if len(fields) == 0:
      continue

    # Identify start of new reactor data
    if fields[0] == "Time(min)":
      reactors[N_reactor] = np.array(data)
      N_reactor += 1

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

    # For diagnostic purposes
    if N_reactor == 1:
      break
print("reactors.keys(): {}".format(reactors.keys()))
print("lengths: {}".format([ len(reactors[i]) for i in reactors.keys() ]))


To check our work we might have printed out the whole `reactors` dictionary and we would have seen an issue. To save space I've just printed out the `keys()` and lengths of each array in `reactors`. The issue is that we have two arrays saved, with keys `-1` and `0` respectively, and the first one is empty. What has happened? Take a minute and think about it before moving on. 

The issue is that the first time that we encounter `"Time(min)"` is at the top of the file, <i>before</i> we have parsed any reactor data. This first encounter with `"Time(min)"` needs to be skipped. An easy way to do this is to only save the `data` if it is non-empty: 

In [None]:
data = [] # will temporarily hold each reactor's data
N_reactor = -1 # keeps track of which reactor we are parsing
reactors = {} # dictionary for holding each reactor's data as an array
with open("mult_reactors.txt",'r') as f:
  for lines in f:
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Skip empty
    if len(fields) == 0:
      continue

    # Identify start of new reactor data
    if fields[0] == "Time(min)":

      # Save if non-empty
      if data:
        reactors[N_reactor] = np.array(data)
      N_reactor += 1

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

    # For diagnostic purposes
    if N_reactor == 1:
      break
print("reactors.keys(): {}".format(reactors.keys()))
print("lengths: {}".format([ len(reactors[i]) for i in reactors.keys() ]))

With this change, we are now successfully parsing the first reactor's data and assigning it to the dictionary as an array. 

We've done almost all of the work. The code as written should work on the rest of the reactors, with two modifications. But first let's just test it to see what it does when we let it loose on the rest of the file:

In [None]:
data = [] # will temporarily hold each reactor's data
N_reactor = -1 # keeps track of which reactor we are parsing
reactors = {} # dictionary for holding each reactor's data as an array
with open("mult_reactors.txt",'r') as f:
  for lines in f:
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Skip empty
    if len(fields) == 0:
      continue

    # Identify start of new reactor data
    if fields[0] == "Time(min)":

      # Save if non-empty
      if data:
        reactors[N_reactor] = np.array(data)
      N_reactor += 1

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

    #
    # DELETED THE DIAGNOSTIC BREAK STATEMENT
    #
print("reactors.keys(): {}".format(reactors.keys()))
print("lengths: {}".format([ len(reactors[i]) for i in reactors.keys() ]))

There are no formal errors, but we can see two problems with the output. First, we see that the amount of data for each reactor is growing, but if we inspect the file we see that it should be constant. Can you see why this is happening?

The problem is that we aren't resetting the `data` list between reactors. So the previous reactors data is still there when we start parsing the next. We can correct this by reinitializing the `data` list when we save the reactor data to the dictionary:

In [None]:
data = [] # will temporarily hold each reactor's data
N_reactor = -1 # keeps track of which reactor we are parsing
reactors = {} # dictionary for holding each reactor's data as an array
with open("mult_reactors.txt",'r') as f:
  for lines in f:
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Skip empty
    if len(fields) == 0:
      continue

    # Identify start of new reactor data
    if fields[0] == "Time(min)":

      # Save if non-empty
      if data:
        reactors[N_reactor] = np.array(data)
        data = [] # reinitialize data after we save it
      N_reactor += 1

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

print("reactors.keys(): {}".format(reactors.keys()))
print("lengths: {}".format([ len(reactors[i]) for i in reactors.keys() ]))

Reinitializing `data` after saving each reactor corrects our problem. The second problem is with the number of reactors. If we inspect the file we see that there are 10 reactors, but our final dictionary only has data for 9. What's the problem?

The last reactor never gets saved because it isn't followed by a `"Time(min)"` header. This is a common issue with parsers with chained data. The last one won't get saved because the file just ends. We can correct this by saving `data` to the dictionary after the loop breaks:

In [None]:
data = [] # will temporarily hold each reactor's data
N_reactor = -1 # keeps track of which reactor we are parsing
reactors = {} # dictionary for holding each reactor's data as an array
with open("mult_reactors.txt",'r') as f:
  for lines in f:
    fields = lines.split() # the file is space delimitted so this breaks up the lines into words

    # Skip empty
    if len(fields) == 0:
      continue

    # Identify start of new reactor data
    if fields[0] == "Time(min)":

      # Save if non-empty
      if data:
        reactors[N_reactor] = np.array(data)
        data = [] # reinitialize data after we save it
      N_reactor += 1

    # Get the numbers
    try:
      data += [[float(fields[0]),float(fields[1])]]
    except:
      pass

# Save the last reactor
reactors[N_reactor] = data

print("reactors.keys(): {}".format(reactors.keys()))
print("lengths: {}".format([ len(reactors[i]) for i in reactors.keys() ]))
print("reactors[9][-1]: {}".format(reactors[9][-1]))

We now have a working solution for our problem. 

Review the final program and think about explaining it to someone just starting on the problem. If you try to explain it in the order it is written, there are several things here that are difficult to explain. 

For example, we initialize `data=[]` and `reactors={}` at the beginning. It isn't obvious that we will need both, the first for temporarily holding each reactor's data, and the second for storing the final data. But during iterative development we converged on this solution. Likewise, it isn't obvious that `if len(fields) == 0:` is the first thing that we would need to check about the line. We added this because we encountered an error while testing a version of the program without it. Another example is the `if data:` statement. We added this because we needed to skip saving the data during the first encounter with the header, but it would be difficult to explain to someone that hadn't gone through the iterative process with us. 

A lot of programming is asynchronous like this. Reading the final solution from start to finish requires explanations about things that happen later. This is why it is VERY DIFFICULT to write a multi-step function or for loop correctly on the first try. We are used to thinking sequentially. To develop your programming intuitition, you need to break problems down and iteratively build up the behavior you need. 