# Manipulating Files and Processing Text <a name='home' />

We have learned about different data types, some things you can do with these data types, and some of the basic structure (conditionals and loops) that helps us write useful code in Lessons 1-3. We have also learned about useful modules (e.g. **pandas**, **matplotlib**, **numpy**, **cartopy**) for working with large datasets and visualizing graphs/maps. Today, we will look at some tools for processing text in Python, and learn how to read and write to files more generally. This is particularly useful if the data files you are using are not in a structure that easily translates to Dataframes, i.e. not in a clean, tabular format. In the end, we will share a few final thoughts on organizign your code and minimizing/troubleshooting errors.  


## Topics
- <a href=#bookmark1>1. Useful *methods* for text processing</a> 
- <a href=#bookmark2>2. Reading and Writing to Filehandles</a> 
- <a href=#bookmark3>3. Smart Practices with Python</a> 
- <a href=#bookmark4>4. Last Exercise?</a> 

## 1. Useful *methods* for text processing <a name='bookmark1' />
Systematically manipulating large text files is one of the most common tasks you will encounter. The most basic tools for this task are the built-in Python string methods. These allow us to convert between strings and lists, test the properties of strings, and modify strings.

A reminder that many types of objects have special built-in functions. We call these endemic functions *methods*, and in a broader discussion of objected-oriented programming practice and theory, we would have much, much more to say about them. However, we're not getting into the object-oriented universe or philosophy here, so you'll have to take as explanation simply that some objects are so routinely manipulated with the same sorts of operations that it pays to have functions dedicated to their processing. In the case of strings and files today, we'll see the *methods* that routinely operate on these types.

Whereas a *function* is written to accept *arguments* and process those *arguments*, a *method* processes the object to which it belongs and is *called* differently. Whereas a *function* such as `print` is called by typing `print("whatever you want")`, etc, a *method* is called by typing a period and the name of the *method* at the end of the object. For example, if *print* were a *method*, it would be called like this: **string_variable.print()**. Notice that there are still **()** at the end of the name of the *method*, and *methods* can accept *arguments* just like *functions*. As a reminder, we've already seen many methods, e.g. the list *methods* **append()** and **extend()** in previous lessons. 

Below, we'll cover the following:

- Basic text processing with **split**, **partition**, and **join**
- Text testing with **endswith()**, **startswith()**, **find()**, and **in**
- Text conversion with **swapcase()**, **replace()**, **upper()**, and **lower()**

### Basic Text  Processing (**split**, **partition**, and **join**)

Let's consider the task of converting a character string of a sentence into a list of words separated by spaces and punctuation marks below. **split** is the most common method for doing so, and it has high flexibility to make lists from a string using a variety of delimiters. See the examples below. Note that I've put the result in comments below each `print` function, but you can also run the cell and see the result.

In [None]:
sentence_string = "I am a well-written sentence, and so I dependably have punctuation. "
print (sentence_string.split(","))
##['I am a well-written sentence', ' and so I dependably have punctuation. ']

print (sentence_string.split())
##['I', 'am', 'a', 'well-written', 'sentence,', 'and', 'so', 'I', 'dependably', 'have', 'punctuation.', '']

print (sentence_string.split(" ",2))
##['I', 'am', 'a well-written sentence, and so I dependably have punctuation. ']

print (sentence_string.split("t"))
##['I am a well-wri', '', 'en sen', 'ence, and so I dependably have punc', 'ua', 'ion. ']

The **split** method is applied to strings. What it does is it takes some delimiter (we used a "," in the first example above) as an argument, and it separates the string into a list separating out each substring between commas, leaving out the commas. Anything can be a delimiter, so long as it is a string. The default is blank space, if you don't specify a delimiter (see the second example above). This is convenient as it will split upon any blank space of any size (such as tabs or just one space). 

In [None]:
somuchwhitespace = "Here   is\t so much      white space,  but    the\ndefault \t still   works!"
print ("the string")
print (somuchwhitespace)
print ()
print ("the split string")
print (somuchwhitespace.split())


There is an optional second argument on the **split** method, which is an integer number **x**. This argument specifies the number of times you want to split the string. After splitting **x** times, the rest of string is kept intact, even if the delimiter is present.

When the delimiter occurs more than once in succession, the split space is an empty string. Thus, it's useful that the default is all whitespace--otherwise different files with different amounts of whitespace between data entries could get very frustrating very quickly!

### Knowledge Check 1.1

Using the above `somuchwhitespace` string, try to recreate the following two lists - what character is being split? It is split every time?

```python
somuchwhitespace = "Here   is\t so much      white space,  but    the\ndefault \t still   works!"

['Here   is\t so much      white space,  but    the\ndefau', 't \t sti', '', '   works!'] #List A
['H', 'r', '   is\t so much      white space,  but    the\ndefault \t still   works!'] #List B
```

Using the function `list` on a string will turn it into a list of all the characters in the string.

Two more splitting methods are **splitlines**, which only splits on line breaks "\n", and **rsplit**, which when it has a second argument, will reverse and start splitting from the end of the string (but will print the list in the string order).


**partition(delimiter)** is a method that acts like **split(delimiter,1)**, but unlike **split**, the delimiter is kept within the list. It always outputs a ***tuple*** of length three. 

```python
somuchwhitespace = "Here   is\t so much      white space,  but    the\ndefault \t still   works!"
print (somuchwhitespace.partition('\t'))
>> ('Here   is', '\t', ' so much      white space,  but    the\ndefault \t still   works!')
```

### Knowledge Check 1.2
For each of the following, predict what would occur. Then, try removing the argument or adding a new argument in lieu of a comma. Predict what would happen as a result for each of the four cells below. Then, run the cells and see if your prediction matches. 

In [None]:
delimiter = "," 
sentence_string = "I am a well-written sentence, and so I dependably have punctuation. "
list_from_string = sentence_string.split(delimiter) 

print (list_from_string)

## Additional practice:
## Try removing the argument or adding a new argument in lieu of a comma. 
## Predict what would happen in each case.


In [None]:
delimiter=" "
sentence_string = "I am a well-written sentence, and so I dependably have punctuation. "
list_from_string = sentence_string.rsplit(delimiter,2) 
print (list_from_string)

In [None]:
somuchwhitespace = "Here   is\t so much      white space,  but    the\ndefault \t still   works!"
print (somuchwhitespace.partition('z'))

In [None]:
somuchwhitespace = "Here   is\t so much      white space,  but    the\ndefault \t still   works!"
list_from_string = somuchwhitespace.splitlines() 
print (list_from_string)

The method with the opposite function from **split** is **join**, which turns a list into a string. It has the form:

```python
seq = ["A","A","G","G","G","C","A","T","T","C","C"]
print ("-".join(seq))
>> "A-A-G-G-G-C-A-T-T-C-C"
```

Some key things to remember is that your delimiter goes at the beginning, and your list that is being joined must contain only strings. 

In [None]:
seq = ["A","A","G","G","G","C","A","T","T","C","C"]
print ("-".join(seq))

### Knowledge Check 1.3


A. Using the `seq` list, create the three strings indicated by the '>>'
```python
seq = ["A","A","G","G","G","C","A","T","T","C","C"]
>> 'AAGGGCATTCC'
>> 'A A G G G C A T T C C'
>> 'A + A + G + G + G + C + A + T + T + C + C'
```
B. How would you create the list shown after '>>' from `seq` using both the **join**  and **split** methods?

```python
seq = ["A","A","G","G","G","C","A","T","T","C","C"]
>> ['A-A-', '-', '-', '-C-A-T-T-C-C']
```

C. Is **join** a `list` or `string` method? How can you tell?

## Testing your text
There are multiple other methods associated with strings, including **startswith**, **endswith** and **find**. Below is an example of each.

In [None]:
seq = "ATGGGCATTAG"
print (seq.startswith("ATG"))
print (seq.endswith("TAG"))
print (seq.find('GCA'))

**startswith** asks if the argument begins the string. **endswith** asks if the argument ends the string. **find** returns the index where the string is found. 

1. What happens when the argument in **find** appears more than once? 
2. What happens when the argument is not in the string?

Use the following example below to see what occurs.

In [None]:
seq = "ATGGGCATTAG"
print (seq.find("AT"))
print (seq.find("notinstring"))

1 - What happens when the argument in find appears more than once? 
    **It looks like it only shows the first instance of the string.**
    
2 - What happens when the argument is not in the string?
    **It looks like -1 is returned.**
    

We have used `in` before on lists and dictionaries, but we can also use them to check if a substring is within a string. 

See the example below, where we look for the beginning and end of a protein-coding region of DNA. To do this, we look for where the start of protein coding occurs (with the code 'ATG') and the end of protein coding occurs (with the code 'TAG' - note that there are more stop codes, but we're using one for simplicity. 

In [None]:
seq = "AAAAGGAATGGGCATTAGTTAGGGGG"
if 'ATG' in seq and 'TAG' in seq: 
    print ("There is a gene in this sequence!")
    beginind = seq.find('ATG')
    endind   = seq.find('TAG')
    print (seq[beginind:endind+3] )

The script above looks for the first index matching 'ATG' and 'TAG', and records those. Then, it uses slicing to cut out a section of the DNA sequence that corresponds. 

Technically, this method is a bit too simple - those of you who know some molecular biology will recognize that the final DNA sequence provided doesn't make sense for a protein-coding sequence - the number of bases should be divisible by three! For those interested, a fun Python challenge is writing a more complicated piece of code that will spit out all possible protein-coding sequences. If anyone is interested in this, Dr. Yang is happy to share a problem focused on this!

## Text Conversions
Systematically replacing the instances of a substring with a replacement substring may be a familiar task of tedium. Python has several methods for systematically converting characters in strings. The most general is the method **replace()**.

```python
oldcityname = "Peking"
newcityname = oldcityname.replace("Pek","Beij")
print ('old', oldcityname)
print ('new', newcityname)

>> 'old Peking'
>> 'new Beijing'
```

Notice that **replace()** does not change the string in place. Remember that strings are immutable, so you have to reassign the variable to refer to the new string object that **replace** returns.

Since Python is case sensitive, as are most programming languages you'll be interested in using, you may also find yourself wishing that all the text in your data was the same case. There are methods for both testing and converting cases. These include **upper** and **lower** which change all the cases to upper or lower case, respectively. **isupper** and **islower** asks if all the cases are upper or lower case.  **swapcase** turns each case from lower to upper or upper to lower depending on what is already present. **isalpha** checks if all characters in the string are letters. 

Run the following cell, and check your understanding of how each method works. 

In [None]:
oldcityname = "Peking"
newcityname = oldcityname.replace("Pek","Beij")
print ("1. Example above")
print ('old', oldcityname)
print ('new', newcityname)
print()

print ('2. Other examples')
print ("True or False?")
print ('isupper()')
print (oldcityname.isupper())
print ()
print ('islower()')
print (oldcityname.islower())
print ()
print ("Change upper or lower case")
print ('upper()')
print (oldcityname.upper())
print ()
print ('lower()')
print (oldcityname.lower())
print ()
print ('swapcase()')
print (oldcityname.swapcase())
print ()
print ("Check if using letters")
print ('isalpha()')
print (oldcityname.isalpha())

### Knowledge Check 1.4
For the following string (`mylab1` or `mylab2`), make the following edits from A-D:
```python
mylab1="Yang Lab, Department of Biology, University of Richmond, Richmond, Virginia, USA"
mylab2="Spera Lab, Department of Geography, University of Richmond, Richmond, Virginia, USA"
```
A. Make all letters upper case

B. Switch the case of each letter

C. Replace 'University of Richmond' with 'UR', "USA" with "U.S.A.", and "Department of Biology/Geography" with "Best Department Ever"

D. Provide only the substring `"University of Richmond, Richmond, Virginia, USA"`

<a href=#home>Return to Top</a> 

## 2 Files and Filehandles <a name='bookmark2' />

Now that we can process text, all we need is... more text. And odds are, that text is going to come in the form of a file, so it's high time that we start using them. The main topics now are:

- Opening and closing filehandles
- Reading from the filehandle with **read()**, **readline()**, and **readlines()**
- Reading from the filehandle iterable
- Writing or appending to a file with **write()** and **writelines()**

### Opening Filehandles
A filehandle is an object that controls the stream of information between your program and a file stored somewhere on the computer. Filehandles are not filenames, and they are not the files themselves. Like variables, filehandles contain the address of the file on the hard drive or other storage media. But unlike variables, filehandles also keep track of your current read position in the file. Imagine your file is like a book in a library. The filehandle tells Python where that book is, and keeps a bookmark in the book for where you currently are. Because filehandles are not the files themselves deleting a filehandle in your script using the `del` function does nothing to the file that handle refers to.

Let's try to open the file containing information on some ancient individuals, ***"51.2.2M.ind"*** in the resources/ folder. 

To open files, we use the function `open`, which takes as argument a string that is the path to the file. If you do not have a file path, it automatically looks in the directory from which the script is being called. 

In general, it is good practice to use absolute path nomenclature (e.g. /Users/myang/some_file or /scratch/myang_shared/lab/PythonBootcamp/Sp24/resources/some_file), but you can also place the file you want in the same directory as your program and not use any path. I often prefer to set up the file paths in a variable, so I don't have to rewrite the file path over and over if I have to use multiple different files from the same path.


In [None]:
pD = "/scratch/myang_shared/lab/PythonBootcamp/Sp24/resources/" #Put your path to your resources/ directory
myfile = open(pD+"51.2.2M.ind",'r')
print (myfile)
myfile.close()

Above, the output of `myfile` is an object that points to the file you want to read in. We used 'r' (or mode='r') as a second argument to indicate we want to read the file. Later, we will introduce some other options for the second argument. At the end, I put the command `myfile.close()`, which is a *method* that closes the file after opening it.

However, none of the above allows us to actually look at the lines in the file. For this, we need to use methods **readline**, **readlines** or a `for loop`. In the cell above, try adding the following line somewhere between the `open` function and the `myfile.close()` command. You should then get the resulting output shown here. To convince yourself this is the first line in the ***51.2.2M.ind*** file, open the IND file to look at the first few lines of the file.
```python
print (myfile.readline())
>> '              Bichon M     Bichon\n'
```

By using the method **readline**, we retrieved a string of the first line of the file, all white spaces and line breaks included.  However, if we continued forwards, adding a second `print (myfile.readline())` , we would find that even though we didn't assign anything anywhere, what was printed was the second line.

```python
>> '                 KK1 M     Kotias\n'
```

The printed line is now the second line in the file. The filehandle object has a memory of what has already been retrieved from the file. This would occur until you reach the end of the file, upon which **readline** would only give you an empty string.

In [None]:
##To test the above, first run this cell.
pD = "/scratch/myang_shared/lab/PythonBootcamp/Sp24/resources/" #Put your path to your resources/ directory
myfile = open(pD+"51.2.2M.ind",'r')

In [None]:
##Then, run this cell multiple times - note how the line keeps changing.
##If you ran it 51 times, you'd end up with an empty string. 
print (myfile.readline())

One way to read all lines in at once is to use **readlines**.

For **readlines**, what is returned is a list of strings, where each string is, in order, for each line in the file. Thus, beginning with this method, we could get all the lines in a file into a list, retrieving any line in the file for use as needed. However, because the filehand retains memory, in the above example, the list returned did not include the first two lines of the **.ind** file. See the example below:

In [None]:
pD = "/scratch/myang_shared/lab/PythonBootcamp/Sp24/resources/" #Put your path to your resources/ directory
myfile = open(pD+"51.2.2M.ind",'r')
x=myfile.readlines()
myfile.close()

print (x)

If we now added **readlines**, we would find:

```python
print (myfile.readlines())
>> ['                SATP M Satsurblia\n',
 '            Motala12 M   Motala12\n',
 '               I9030 M Villabruna\n',
 '               I0898 M Kostenki12\n',
 '               I0062 M Vestonice16\n',
 '               I0876 M Kostenki14\n',
 '        I0066.damage M    Pavlov1\n',
 '        I0909.damage F   Muierii2\n',
 '        I0004.damage M Vestonice13\n',
 '        I0080.damage M Vestonice15\n',
 '        I0065.damage M Vestonice43\n',
 '        I0889.damage M    Ostuni2\n',
 '        I0869.damage F    Ostuni1\n',
 '        I0878.damage F Continenza\n',
 '        I0006_damage M Vestonice14\n',
 '        I0907.damage F    ElMiron\n',
 '        I9050.damage F AfontovaGora3\n',
 '      Ranchot.damage F  Ranchot88\n',
 '    Bockstein.damage F  Bockstein\n',
 '        Ofnet.damage F      Ofnet\n',
 '       LCX-13.damage M LesCloseaux13\n',
 '        Berry_au_Bac M BerryAuBac\n',
 '              Q116-1 M GoyetQ116-1\n',
 '        Cioclovina_d M Cioclovina1\n',
 '                B1_d M Paglicci108\n',
 '             Q53-1_d M GoyetQ53-1\n',
 '           Q376-19_d M GoyetQ376-19\n',
 '            Q56-16_d M GoyetQ56-16\n',
 '            GA252snp M Paglicci133\n',
 '          Hohle_Fels M HohleFels79\n',
 '                HF49 M HohleFels49\n',
 '             Rigney2 F    Rigney1\n',
 '                  Q2 M   GoyetQ-2\n',
 '               BRI_d M Brillenhohle\n',
 '               BUR_d M Burkhardtshohle\n',
 '           Rochedane M  Rochedane\n',
 '               ADI_d M Iboussieres39\n',
 '       Falkenstein_d M Falkenstein\n',
 '             CRC-1_d M Chaudardes1\n',
 '               I1577 M   KremsWA3\n',
 '                 MA1 M     Malta1\n',
 '                 AG2 M AfontovaGora2\n',
 '                 KO1 M Hungarian.KO1\n',
 '            LaBrana1 M   LaBrana1\n',
 '         Leipzip_B_U M      Oase1\n',
 '               I0061 U    Karelia\n',
 '           Loschbour M  Loschbour\n',
 '           Ust_Ishim M   UstIshim\n',
 '           Stuttgart F  Stuttgart\n']
```



1. Can you envision a reason you might want to be careful with **readlines**? What if the file size was 10 Gb?

This is great if your file isn't too big, but one nice thing about the filehandle structure is that it is not very memory intensive. When you read a single line in, it 'forgets' the other lines, unless you stored them in other variables. Thus, you can quickly scan through a giant file, grabbing only the information you need. This saves memory. **readlines** will read all lines in, and thus won't help with saving memory. 

Another way of reading the lines in a file is through a `for loop`. Here, the dummy variable refers to each line until it reaches the end of the file. 

In [None]:
pD = "/scratch/myang_shared/lab/PythonBootcamp/Sp24/resources/"
myfile = open(pD+"51.2.2M.ind",'r')
for line in myfile: 
    print (line)
myfile.close()

Always remember to add the method **close** at the end when you are done with the file. 

This closes the file handle. While it often doesn't affect your script (you won't get an error if you accidentally forget to close the file), it is a useful habit to add this command, because until the file is closed in some form, it takes up system resources. While not a problem in smaller code, if you have a script that opens and uses millions of different files, you might take up too much of your system resources, slowing down or crashing your computer. 

Also in an interactive space like here - if you opened the file in an earlier cell and called a few lines, and then forgot to close, if you were to read lines in later cells, it would remember it was called earlier - this may be what you want, or it may make you only look at part of the file. 

We now have many ways to read in the lines of a file into strings, but we still need to edit the string to retrieve useful information. Now, though, we can use the text processing tools we learned at the beginning of this lesson!

Below, I used **split** with the default to turn each string into a list of strings (in this case, a list of three strings for each column in the **ind** file). I then used the method **isalpha** to check if the third element, `x[2]` contains all letters. If it does, I print `x[2]`. If not, I do nothing and continue moving through the file. 

In [None]:
pD = "/scratch/myang_shared/lab/PythonBootcamp/Sp24/resources/"
myfile = open(pD+"51.2.2M.ind",'r')
for line in myfile:
    x = line.strip().split()
    if x[2].isalpha() == True: print (x[2])

### Knowledge Check 2.1
1. I introduced a new method **strip** above - look up what it does, either by using ?? or googling or both. What is it doing here, that changes the output compared to the cell above? Try printing `line.strip()` and just `line` by itself if you're not sure.
2. Edit the script above so you print the names (third column) of all individuals who are female (identified by 'F' in the second column). 

### Writing to Files
Now that you did all this work to extract the parts of the data you want, you might not want all your work to go to waste by remaining within the script. Perhaps you want to take the set of data and use it in another program, or put it into a nice format to stick into a presentation! Thus, the next step is to write to file. Luckily this is very easy and uses the same `open` function as reading files does. 

Remember when we used the second argument in `open` of 'r'? Well, if we use 'w' instead, we instead tell python to create a new file with the file path and name given in the first argument. Then, anytime we use the method **write** on the filehandle, we will write into that new text file.  

In [None]:
pD = "/scratch/myang_shared/lab/PythonBootcamp/Sp24/students/mel/" 
##The above writes into my working folder - you'll want to edit to write into your folder!
newfile = open(pD+"Lesson5_write2file.txt",'w')
newfile.write("Woah, are we making a new file from scratch?\nYes we are!")
newfile.close()

print ("Check your folder using Linux commands and see if there's a new file!")

Note the important role "\n" is playing - what would happen if we just used regular spaces there?

You have to tell Python every single thing you want in the file, including line breaks.

**writelines** takes a list of strings and writes these to file. See the following.

In [None]:
newfile = open(pD+"Lesson5_write2file.txt",'w')
newfile.writelines(["Woah, are we making a new file from scratch?","Yes we are!","Now we are using writelines!","What is wrong here?!"])
newfile.close()

If you looked inside your file, the three sentences are now on the same line. You'll want to add `\n` to keep the end of each string to fix this. 

Note that we overwrote the original `Lesson5_write2file.txt`. This is because when we used the 'write' option on the `open` function, it will automatically override any previous file handle with the same name! This old file will be permanently deleted, and it won't be in your Trash or Recycling Bin to retrieve. 

***This is one of the easiest and most devastating bugs you can make, so always be very aware of when you are using 'w' vs. 'r' to avoid erasing a previous file you meant to keep!***

Another option in `open` is the 'append' or 'a' option. This allows you to add to the end of a file. 

In [None]:
newfile = open(pD+"Lesson5_write2file.txt",'a')
newfile.write("\nNow I am adding a line to the end!\n")
newfile.close()

Now, with **append**, I've directly added into the file instead of overwriting the old one. 

I generally do not use the 'append' option, as I would rather write things to a new file and concatenate using the Linux command, to make sure I don't accidentally add things I do not want to older files. Consider if you ran the above code five times (give it a try!) - each time you'd keep adding to your original file, which can quickly make things build up unintendedly. 

Lastly, a note on **close** and writing files. With writing files, Python may not make the changes you stipulate right away, so if you plan to evaluate the contents of the file you're writing in the same script (or for instance use that file for something else during the run of that script), it is wise to close the filehandle to ensure that all the write operations you've requested are performed. Python will close any files at the end of the program's execution, so in most cases, this is unlikely to be a problem, but again in some situations as described above, it can be a problem. Thus, I again encourage finishing out opening any file with the **close** method. 

While we're on the subject, it is almost never a good idea to write to a file then read from it in the same script. When your data is in the form of Python objects those objects are stored in memory, and accessing data stored in *memory is 6 to 100,000 times faster than a hard disk*. We have not talked much about errors or code organization yet, but as your scripts get more advanced and you are dealing with larger text files, the memory your script uses and speed it runs is incredibly important, so it is good to start thinking of how to write code that uses little memory and is efficient. Effective troubleshooting for low memory, high speed code is super useful to coders!

### Knowledge Check 2.2

As a quick check, can you take the **51.2.2M.ind** file in the `resources/` folder, and then use Python to make a new file containing the same information, in all upper case, with only tabs separating between columns, with no extra white space at the beginning? Name the new file **51.2.2M.edited.ind**. Note that this utilizes skills from the first section on methods for text processing. 

The beginning of the file would look like this:
```
BICHON  M       BICHON
KK1     M       KOTIAS
SATP    M       SATSURBLIA
MOTALA12        M       MOTALA12
I9030   M       VILLABRUNA
I0898   M       KOSTENKI12
I0062   M       VESTONICE16
I0876   M       KOSTENKI14
I0066.DAMAGE    M       PAVLOV1
I0909.DAMAGE    F       MUIERII2
```

<a href=#home>Return to Top</a> 

## 3 Smart Practices with Python  <a name='bookmark3' />

At this point, you have most of the groundwork you need to write basic scripts, and much of the rest is practicing and building up your knowledge of different functions and modules until the logic and vocabulary becomes second nature. 

This last section is some parting thoughts to keep in mind as you continue to code!

### 3.1 On Errors

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." – Brian W. Kernighan

Let's be honest, most of us aren't perfect. And in our zen self-awareness, we are well-served to be sensitive to our predispositions to err in particular ways. We are probably even better served to listen to the computer when it tells us we've made a big mistake. Python can be quite good at communicating our flaws to us, if only we're receptive to the constructive criticism in its spartanly high-contrast black and white print.

This morning we're going to look at common sources of error, see what to look for as feedback from Python, and learn a couple of tricks to both obviate or track down bugs, should they occur.

If you're very lucky, an error will cause Python to give up right away, while others will cause insidious bugs that sneak through unnoticed until you present your work in lab meeting and someone calls out you on your exciting, but seemingly impossible (and ultimately bogus) result.

As a reminder, the errors that don't give an immediate error are called **logical errors**, while those that give immediate errors are called **syntax errors** or **runtime errors**. [This link](https://rollbar.com/blog/python-errors-and-how-to-handle-them/) gives a more in-depth run-down. You've already had to deal with these, by the very fact that you have been coding for the last five lessons. Here, we will talk about them explicitly. 

**Logical errors** are the hardest to fix, as you have no aid as you're executing the script. Your code worked, but perhaps not as you intended. Thus, the problem is *logical* - you need to figure out why what your brain is thinking is not matching what the computer is 'thinking'. Logical errors are found by going through the script and making sure each part is printing what you think it should be printing, and they are easier to fix as you gain better intuition about how different data types and functions work. However, even then it can be difficult to find the dumb slicing or data typing mistake. There are some strategies that if you always employ, will minimize the amount of errors or at least help you catch them early in writing code. 

- Strategic Initiative 1: Test Early, Test Often
- Strategic Initiative 2: Be Verbose
- Strategic Initiative 3: Be Boring, Be Obvious

### 3.2 Strategic Initiative 1: Test Early, Test Often

You can save yourself lots of time by testing frequently. Debugging 100 lines of code is often more than 10 times as hard as debugging 10 lines. Writing a ton of code that generates output without checking if each component works individually does NOT make you a coding rock star; it makes you sloppy. Organize your code into sections, where you have a sub-goal for each section. For instance, you could have a section for inputting data into a variable from a file, a section for subsetting the data for what you want, and then a section for visualizing your final data. Checking that each subsection works will be easier than making no subsections and looking across all your code.

### 3.3 Strategic Initiative 2: Be Verbose

"Errors should never pass silently." -- The Zen of Python, by Tim Peters

One of the easiest ways to debug code is to print out the value of variables at the point things start going wrong. Of course, if you knew where things were going wrong, you would probably know what was going wrong in the first place, so to find this, a divide-and-conquer approach is often fastest. Start out by putting a bunch of distinguishable `print` statements in throughout your code, then narrow things down gradually until you're right at the broken line of code. Having subsections of code from Strategic Initiative 1 gives you great places to add `print` statements, to check if each subsection is working correctly. 

It's not a bad idea to include these `print` statements in moderation before you need them. If you think there will be a subset of your data that will fail a logical test, set up your **if** and **else** statements to report the incidence of failure.

For instance, in nested loops, I often will print the variables belonging to each `for loop`, making sure that what is printed out is what I expect. 

### 3.4 Strategic Initiative 3: Be Boring, Be Obvious

"Programs must be written for people to read, and only incidentally for machines to execute." --Structure and Interpretation of Computer Programs Hal Abelson and Gerald Jay Sussman

As you get better at coding, you will start to take shortcuts and combine lines. As soon as something doesn't behave as you expect, you should decompose your compound statements (e.g. list comprehensions), as this is a common source of error. 

It's always important to comment your code. The [Python Style Guide](https://peps.python.org/pep-0008/#comments) has some good recommendations about comments. Anything you did that required some thinking on your part should ideally be commented, since you might come back to the code weeks or months later and have forgotten why you did it that way. Also try to keep your comments up to date with the code. While this is important if future coders are adapting your code for some use, it's also important for you! The person you are six months from now will not remember a single thing about how you set up your code today. You will save future you so much work, if you take a bit of time now to comment your code. 

### 3.5 Strategic Initiative 4: Start Small

For most of us taking this workshop, our ultimate goal is to work with large datasets. By now, you may have already noticed that some of these datasets take a lot of time to load or process through our script. When you're troubleshooting, this can be a pain. If you're working with a large dataset, it's always a good idea to take a subset of the data to work with as you're writing the script. One way is making a smaller file using `head` in Linux to work with as a test datafile, so you grab the first N rows to use. This is often the fastest and easiest way, but note that it may not always show all the variety of what you might see in the dataset, so grabbing enough so you feel like you have good representation might be useful. 

If you can get everything working on your small dataset, it's much more likely you'll get everything working on the large dataset, and it will be a lot faster to catch your errors. So take the time to start small!

### 3.6 Strategic Initiative 5: The Computer is only as smart as you are.

This might be changing with ChatGPT and the like, but it's important to remember as you write code. Your code will never make sense and give you what you want if you don't understand the data and how they work. Before you do any coding, dig into the problem - understand what it's asking for, and if it involves a dataset, spend the effort to learn how the data are set up. 

If you are working with a small version of the dataset (Strategic Initiative 4), especially if you take a manageable set of 5-10 lines at first, you could probably manually draw or write out what your end result would be for your code directly on a sheet of paper. Take a moment to do that - draw the final graph you want, or calculate the average you would expect to see, etc. That way, when you are writing your code, you can test that you get the expected output for your smaller dataset, and scale up from there to the larger dataset where you won't be able to the calculations manually!

Don't forget Strategic Initiative 2 - be verbose! Every time (or at least a lot of the time) you use a datafile, start a for loop, initialize a variable, take a moment to comment or fix in your brain what you would expect would be the re

<a href=#home>Return to Top</a> 

## 4 Last Exercise?  <a name='bookmark4' />

In lieu of a week of exercises, we thought the knowledge checks above, and 1-2 more exercises here is good enough. 

### Exercise 1
One common file for storing DNA sequences is a FASTA file. For more of a description, see [this link](https://compgenomr.github.io/book/fasta-and-fastq-formats.html). But basically, you have a header line marked by '>' for one entry, followed by one or more rows showing the corresponding DNA sequence for that entry. 

For example, the FASTA file below has two entries, for 'gene1' and 'gene2', and the bases following each header indicate the corresponding DNA sequence. 'gene1' has the sequence `ATGAGACGTAGTGCCAGTAGCGCGATGTAGCGATGACGCATGACGCGCGACGCGCGAGTGAGCCATACGCACGCATTGGCA` (the line breaks don't matter). 
```
>gene1
ATGAGACGTAGTGCCAGTAGCGCGATGTAGCG
ATGACGCATGACGCGCGACGCGCGAGTGAGCC
ATACGCACGCATTGGCA
>gene2
ATGTTCGACGCATACGACGCGCAGTACCAGCA
ATGACGCACCGGGATACACGACGCGGATTTTT
ACGCACCGAGATAGCATAAAAGACCATTAG
```

Now that we know this, let's look at two files - each with one entry for the mitochondrial DNA of a Neanderthal. Under `resources/`, you'll find `Mezmaiskaya.fasta` and `ElSidron.fasta`. Each DNA sequence has the same number of bases, such that the base in position 1 for Mezmaiskaya aligns to the base in position 1 for El Sidron. 

Write a script where you retrieve the DNA sequences for both files and compare them to each other. Determine (1) the total number of bases in each DNA sequence, and (2) the fraction of positions that share the same base (i.e. the percent similarity between these two DNA sequences). 

### Exercise 2

Now let's look at the `RVA_1939_present.csv` file that we worked with in Lesson 4. Remember that this provides the maximum and minimum temperatures for every day from Mar 1, 1939 to Feb 20, 2024. 

Determine the average maximum temperature for the month of July from 1939 to 2023. Use a for loop to print out this information, with the format "Year# - AvgMaxTemp# C".

Optional 1: If you're in the mood for visualizing - plot this information using a scatterplot using `matplotlib`. 

Optional 2: And if you really want to play around - determine the average maximum temperature for each month, and visualize them all on the same plot using different colors. Add a legend so we can tell which data correspond to which month. While this can be done with lists, you might find it useful to put the corresponding information in a `pandas Dataframe` or `numpy array`. 

<a href=#home>Return to Top</a> 

**Okay, that's everything we have for this semester, in terms of teaching! We will assign mini-projects for you all to work on over the next two weeks. If you didn't get to all the problems, no sweat, but we may recommend good practice ones if you're feeling uncertain about the mini-project and want a few more exercises to work on first.**