# Summary of this notebook
---

## Key points
1. Everything in Python is an object with its own methods
2. We are going to introduce a **file object**
   * allows us to use, access, manipulate files
   * like all objects, has associated methods: open() and others

## Overview of this module
1. I/O Data
2. File Objects
    1. Paths - where are the files we create placed?
3. Rules of reading, writing, overwriting, appending to files
4. Creating plain text files


## Extra resource: 
* [CS50 in Python](https://cs50.harvard.edu/python/2022/weeks/6/) covers I/O

# I/O
---

A useful program will usually require data input to manipulate and will spit out the results of this manipulation. We have learned a tiny bit of some manipulations with strings. 

In this module, we will learn how to open, read and write files - sometimes even multiple files at multiple locations. 

We will use "flat files" (ie. no structured relationship like there is in databases, for instance). This means text files (.txt) or comma-separated files (.csv). 

## Introductory Example: Creating and writing to a file

**Step 1:** Go to the folder icon on the left. Click on it and navigate to "content" folder. Click on that. This is the default location of files that you create within a program. You should see that there are no text files there.

**Step 2:** What does the following program do? 

In [None]:
# The following is called a "List Comprehension" and it is a way of efficiently filling a list from a loop
# -------------------------------
my_list = [i**2 for i in range(1,11)]
print(my_list)

f = open("output_data/output.txt",  # creates file object and output file path
         "w+") # 'w+' permission: write or read('+')

for item in my_list:
    f.write(str(item) + "\n")

f.close() # ends access to file; will not write to file without this

# if you check, you will notice that there is a file in the same path as this notebook called 'output.txt'
# to determine what this program does, we can interrogate the code.... 
# of course, you can also just open 'output.txt' file.

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


In [None]:
my_list2=["AT","CG","cg"]
# what happens if you use this line, where "w" isn't specified? Try it with w, r, a and see which one allows you to create a new file etc
f = open("output_data/output_test_1.txt", "w+")

#Does this provide a hint about what "w" is doing?
f = open("output_data/output_2.txt","w+")
# ---------------------
# CHECK YOUR FOLDER AND SEE IF THERE IS A NEWLY CREATED FILE CALLED output_2.txt
# ---------------------
for item in my_list2:
    f.write(item + "\n")
f.close()

## File Objects
A file object allow us to access and manipulate data.
* `open(file_address, access_mode)` is a file object method that includes two arguments: `file_address`, `access_mode`
    * access mode (sometimes called "file permissions"):
        1. r/r+  - cursor is at the beginning of the file; read only/read and write
        2. w/w+  - cursor is at the beginning of the file; write only/read and write
        3. a/a+  - cursor is the end of the file (for appending); append only/read and append
    * other methods (not exhaustive, there are other methods associated with files):
        * `file_object_name.read(size)`: if size argument is omitted, it reads until **end of file (EOF)**
        * `file_object_name.readline(size)`: returns string of first line (until newline or EOF is encountered)
        * `file_object_name.readlines(size)`: returns list
        * `file_object_name.write()`: writes strings to a file
        * `file_object_name.close()`: a file cannot be read from or written to until it is opened again

In [9]:
# How do we open a file?
#-------------------------------
# the format that we use is:
# variable_name = open(path_if_necessary_and_file_name,mode) 
# where mode = append, read, write
# ---------------------------

f = open("output_data/output.txt", "r+") # writes over file; 'r+' keeps the contents

#the command above created a textfile called 'output.txt' in 'w' mode. However, in the cell above, we created an output.txt file, too. What happened?
# you should be able to check that there is now a file called output.txt
# the new file **object** is stored as the f file object which now has **attributes and methods**
# instead of a filename,"output.txt", you can insert a path name - although we won't worry about that so far in this course

f.write("Hello JAX!"+"\n")
f.close()

### File objects vs. file names vs. file contents

The subtle differences between file name, file object and file contents:

**File object** <--Represents the file itself. You use methods like .read on file objects and ONLY on file objects (you will get an attribute error term if you attempt to use .read method on something that isn't a file object)

**File name** <--String with the name of the file

**File content** <--String that stores text of the file itself; .read()

In [11]:
f = open("output_data/dna.txt", "w")
#the command above created a textfile called 'dna.txt' in 'w' mode. 
# you should be able to check that there is now a file called dna.txt
# -------------------
# since we used "w" and not "w+" as the access mode, we should get an exceptionIO raised if we attempt to read the contents of the file
# as we do in the next line - currently hashed out. If we changed the w to w+, we could read the contents of the file to the screen.
# -------------------
#f.read()
f.write("ATTGTCCCGTu"+"\n")
f.close()
# ---------------------------------------
my_file_name = "dna.txt"
# Note: you can just create a PLAIN TEXT FILE with a random sequence in it or you can create the file using the technique above
print(my_file_name)

#my_file_pointer is the pointer
my_file_pointer = open(f"output_data/{my_file_name}")
print(my_file_pointer)
#read my file contents
my_file_contents = my_file_pointer.read()
print(my_file_contents)
my_file_pointer.close()

dna.txt
<_io.TextIOWrapper name='output_data/dna.txt' mode='r' encoding='utf-8'>
ATTGTCCCGTu



In [14]:
# NOTE: THIS WILL BE DIFFERENT FOR GCP/HPC. The following is what happens when you are using your local memory
# that is: running this notebook on your laptop rather than on the cloud. 
# ______________________________________
# my suggestion is that if are using a file to put it in the same directory/folder as the jupyter notebook that is accessing it.
# If not, you can always type the full path to the file, it is just a bit more complicated on the first step. 
# Note the file path will look different if you are on a PC.
# ------------------------------------
# The file path below is for my mac os. 
#my_file_test = "/Users/presgd/MyPythonFiles/dna.csv"
my_file_test="source_data/MusAgouti.txt"
# the same file path for windows should look something like this: 
#my_file_test = "C:\Users\presgd\mypython\dna.csv"

print(my_file_test)
print("-----------")
# we are creating the file object my_file_test_pointer
my_file_test_pointer = open(my_file_test)
print(my_file_test_pointer)
print("~"*15)
# the unimaginably named my_file_test_contents should regurgitate what is 
# actually in your file
my_file_test_contents = my_file_test_pointer.read()[0:10]
print(my_file_test_contents)
print("*******")

source_data/MusAgouti.txt
-----------
<_io.TextIOWrapper name='source_data/MusAgouti.txt' mode='r' encoding='utf-8'>
~~~~~~~~~~~~~~~
>NM_001271
*******


## Reading/Writing Text from Files

**The _major_ tasks that we want to be able to do when manipulating text files include:**

1. Double check that the file exists manually *and that it is in the correct path* **or** that you can create a file
2. Open the file 
3. Read file (or write to the file, or a different file, or read and write to the file)
    * These are all **methods for file object**
4. Close file
    * **method for file object**

### Procedure 
1. Check file exists
2. Open file
3. Read file or _Append_ text to end of existing file or _Write_ to a file (this will OVERWRITE anything at the beginning of the file)
4. Close file


### Writing to an output file
The steps involved in writing to a file: 

1. Open/create file and specify access (or Python will assume r and you won't be able to write to it)
2. Write to file:  `.write()`
3. Close file: `.close()` – ensures that output to a file is flushed

In [None]:
#remember the first example we gave and we asked what it did?
# Here it is again but this time you know what it does!
my_list = [i**2 for i in range(1,11)]

my_file = open("output.txt", "w")

for each in my_list:
    my_file.write(str(each)+"\n")

# After running this cell, you could try running it again with the following line hashed out. 
# When you open output.txt, what happens when the .close() is hashed out? 
my_file.close()

### Appending to a file
The steps involved in writing to a file: 

1. Open file and specify **append** access (or Python will assume r)
2. Write to file: `.write()`
3. Close file: `.close()`

In [15]:
# List comprehension
my_list = [i**2 for i in range(1,11)]
# recreate the original output.txt file: 
# ---------------------------------
f = open("output_data/output.txt","w+")
for item in my_list:
    # convert each number to string since write takes string objects. 
    f.write(str(item) + "\n")
f.close()
# ---------------------------------
# let's use the permission a+. What does this do to our output? Run it and check the output.txt file
# After you have checked the file, change the a+ to r+ and see how that changes the output.txt file

my_file=open("output_data/output.txt","a+")
my_file.write("Hello JAX!"+"\n")
my_file.close()

### Reading from a file
1. Open file 
2. Read file 
   1. read one line at a time from file. First time it is called, it reads first line, second time it is called, it reads second line etc. <br>
            `.readline()`
   2. read entire file contents <br>
            `.readlines()`
3. Close file

In [None]:
#All contents of file are read at once
my_file=open("output.txt","r")
# you could print off only the first 5 characters -- including the \n
print(my_file.read(5))
my_file.close()
print("------------")
#one line at a time of the file contents are read
# the difference between readlines() and readline() methods.
my_file=open("output.txt","r")
print(my_file.readlines())
#print(my_file.readline())
#print(my_file.readline())
#print(my_file.readline())
my_file.close()

## [Rosalind Problem](https://rosalind.info/problems/ini5/)
**Problem 3B1**

**Given:** A file containing at most 1000 lines.

**Return:** A file containing all the even-numbered lines from the original file. Assume 1-based numbering of lines.

**Sample Dataset:**

> Bravely bold Sir Robin rode forth from Camelot

> Yes, brave Sir Robin turned about

> He was not afraid to die, O brave Sir Robin

> And gallantly he chickened out

> He was not at all afraid to be killed in nasty ways

> Bravely talking to his feet

> Brave, brave, brave, brave Sir Robin

> He beat a very brave retreat

**Sample Output:**

> Yes, brave Sir Robin turned about

> And gallantly he chickened out

> Bravely talking to his feet

> He beat a very brave retreat

In [None]:
f = open("source_data/Rosalind_3C1.txt", "r")
stuff = f.readlines()[1:] # line index O is not even
f.close()

o = open("output_data/Rosalind_3C1_out.txt", "w+")
for i, line in enumerate(stuff):
    if i % 2 == 0:
        o.write(line)

o.close()

# [Rosalind Problem](https://rosalind.info/problems/gc/)

**Problem 3B2**

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called [FASTA](https://en.wikipedia.org/wiki/FASTA_format). In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

**Given:** At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

**Return:** The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

**Sample Dataset:**

\>Rosalind_6404

CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG

\>Rosalind_5959

CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC

\>Rosalind_0808

CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT

**Sample Output:**

Rosalind_0808

60.919540


## Extracting file contents: .rstrip()

Opening files can sometimes require something _extra_.

For instance, files can contain **hidden** characters such as **\n** (which means start a newline) which need to be removed. A straightforward way to do this is to use the string method: **`.rstrip("\n")`**

There is also a separate method for removing hidden characters from the left side of the string called `.lstrip()`. However, that generally isn't relevant for I/O so I mention it merely for completeness.

In [30]:
#open the file that we created above. Notice: when we originally created dna.txt, we gave it access 'w', but
# when we open the file again, we can specify different access. Here we have not given any access mode so 
# the default of Python is to assume "r" as access mode
# -----------------------
my_file=open("output_data/dna.txt")

# #read contents - you could do this in two lines:
# # 1. read in files contents
# my_file_contents= my_file.read()
# # 2. put file contents into variable WITHOUT the .rstrip(), see what happens: 
# my_DNA=my_file_contents

# or 2. put file contents into variable WITH the .rstrip(), see what happens:
#my_DNA=my_file_contents.rstrip("\n")

# -----------------------------
#notice that you can BE MORE EFFICIENT AND append the .rstrip method to the read method.
# --------------------------
my_DNA = my_file.read().rstrip("\n")
#calculate length by using built in function len(). We will see this method A LOT. 
dna_length=len(my_DNA) # returns 12 w/out .rstrip("\n")
#print output
print("Sequence is "+my_DNA+" and the length is "+str(dna_length) +" bp")
my_file.close()

Sequence is ATTGTCCCGTu and the length is 11 bp


## Aside: Why do we need to *__close__* our files? 
Data is buffered during the I/O process which means it is held in a temporary location before being written to the file. 

Python doesn’t flush the buffer (write the buffer to the file) until it is sure that you are done writing so you must close the file or the data won’t be transferred to the correct file.

Digression for when you want to use Python in the wild: 
* Python can automatically close your files for you by using context managers: the two built in methods: __enter__(); __exit__()
    * We won't discuss them in this course


In [None]:
# create a file and open the file for writing
write_file = open("dummy.txt", "w")
# Write to the file
write_file.write("Not closing files is VERY BAD.")
# if we hash out the next line, the buffer won't know to read the line into our dummy.txt file
write_file.close()