### GCP enviroment setup:
only run the following code block if using GCP, if running this notebook locally, you can download files from Canvas. 
* we will quickly discuss the elements of this and file management in GCP

In [None]:
#!gsutil cp gs://jax-presgraves-edusumner2-courses-code/Intro_to_Python_1_2025/Module_2/*.csv /content/

# Summary
1. Everything in Python is an object
2. We are going to introduce a **file object**
   * allows us to use, access, manipulate files
   * like all objects, has associated methods: open() and others

# Overview of this module
1. I/O Data
2. Rules of reading, writing, overwriting, appending to files
3. Paths - where are the files we create placed?
4. Creating plain text files
5. USER Input
    * how to validate?  

# I/O

A useful program will usually require data input to manipulate and will spit out the results of this manipulation. We have learned a tiny bit of some manipulations with strings. 

In today's lecture, we will learn how to open, read and write files - sometimes even multiple files at multiple locations. 

We will use "flat files" (ie. no structured relationship like there is in databases, for instance). This means mostly text files (.txt) or comma-separated files (.csv). 

### What happens if we want to read in a file and then write to a different file?  

* Go back to your "Home" tab and you should see that there are no files there.

What does the following program do? 

In [1]:
# The following is called a "List Comprehension" and it is a way of efficiently filling a list from a loop
# we will discuss loops next, this is just a glimpse of ways to built on simple techniques.
# -------------------------------
my_list = [i**2 for i in range(1,11)]
print(my_list)

f = open("output1.txt","w+")

for item in my_list:
    f.write(str(item) + "\n")
f.close()

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


In [6]:
my_list2=["AT","CG","cg"]
# what happens if you use this line, where "w" isn't specified? Try it will w, r, a and see which one allows you to create a new file etc
#f = open("output_test_1.txt") 

#Does this provide a hint about what "w" is doing?
f= open("output_3.txt","w+")
# ---------------------
# CHECK YOUR FOLDER AND SEE IF THERE IS A NEWLY CREATED FILE CALLED output_2.txt
# ---------------------
for item in my_list2:
    f.write(item + "\n")
#f.close()

## Reading Text from files

We will need to read information from a file such as a FASTA file (https://en.wikipedia.org/wiki/FASTA_format
):

\>seq0

FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF

\>seq1

KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM

\>seq2

EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK

Or we want to write information to a file such as the results from a BLAST search.

How do we accomplish such tasks? 

**The _major_ tasks that we want to be able to do when manipulating text files include:**

1. Ensure that the file exists manually *and that it is in the correct path*
2. Open the file <- Right now, we have to rely on raising an IOexception when we try to open a file that doesn’t exist, python will give us [Errno 2]: no such file or directory, we saw this in the cell above when we didn't give permissions to the file that we wanted to create. I'll briefly mention that there are tools to ensure that a file exists: Try/except system. This will be similar to the if/elif/else decision loops. We won't cover Try/Except in this course.
3. Read file (or write to the file, or a different file, or read and write to the file)<-method for file object
4. Close file <- method for file object

### Procedure for _reading_ a file: 
1. Check file exists
2. Open file
3. Read file
4. Close file

### Procedure for _writing to_ an _existing_ file: 
1. Check file exists
2. Open file
3. _Append_ text to file
4. Close file

### Procedure for _overwriting_ or _creating_ a _NEW_ file: 
1. Open (or create) file
2. Write to file
3. Close file



### A file object allow us to access and manipulate data. 
* open(file_address, access_mode) is a file object method that includes two arguments: file_address, access_mode
* access mode (sometimes called "file permissions"):
    1. r/r+  - cursor is at the beginning of the file; read only/read and write
    2. w/w+  - cursor is at the beginning of the file; write only/read and write
    3. a/a+  - cursor is the end of the file (for appending); append only/read and append
* other methods (not exhaustive, there are other methods associated with files):
  * file_object_name.read(size); if size argument is omitted, it reads until EOF
  * file_object_name.readline(size); returns string of first line (until newline or EOF is encountered)
  * file_object_name.readlines(size); returns list
  * file_object_name.write(); writes strings to a file
  * file_object_name.close(); a file cannot be read from or written to until it is opened again

In [10]:
# How do we open a file?
#-------------------------------
# the format that we use is:
# variable_name = open(path_if_necessary_and_file_name,mode)
# where mode = append, read, write
# ---------------------------

f = open("output.txt", "w")

#the command above *created* a textfile called 'output.txt' in 'w' mode. However, in the cell above, we 
# created an output.txt file, too. What happened?
# You can check that there is now a file called output.txt
# the new file **object** is stored as the f file object which now has **attributes and methods **
# instead of a filename,"output.txt", you can insert a path name - although we won't worry about that so far
# in this course

f.write("Hello JAX!"+"\n")
f.close()




# How can we incorporate list (comprehensions) into I/O? 
(I'm so glad that you asked!)

In [11]:
my_list = [i**2 for i in range(1,11)]
# recreate the original output.txt file: 
# ---------------------------------
f = open("output.txt","w+")
for item in my_list:
    # convert each number to string since write takes string objects. 
    f.write(str(item) + "\n")
f.close()
# ---------------------------------
# let's use the permission a+. What does this do to our output? Run it and check the output.txt file
# After you have checked the file, change the a+ to r+ and see how that changes the output.txt file

my_file=open("output.txt","a+")
my_file.write("Hello JAX!"+"\n")
# as an example, what do you expect the following to give you? Why?
#print(my_file)
my_file.close()

The subtle differences between file name, file object and file contents:
-----------------------

**File object** <--Represents the file itself. You use methods like .read on file objects and ONLY on file objects (you will get an attribute error term if you attempt to use .read method on something that isn't a file object)

**File name** <--String with the name of the file

**File content** <--String that stores text of the file itself; .read()

In [20]:
# my suggestion is that if are using a file to put it in the same directory/folder as the jupyter notebook that
# is accessing it.
# If not, you can always type the full path to the file, it is just a bit more complicated on the first step. 
# Note the file path will look different if you are on a PC (the forward slashes will be replaced with backward slashes etc).
# ------------------------------------
# The file path below is for my LOCAL mac os. REMEMBER THAT THIS PATH WILL BE DIFFERENT ON COLAB! 
my_file_test = "/Users/presgd/MyPythonFiles/dna.csv"
# the same file path for local windows users would look something like this: 
#my_file_test = “C:\\Users\\daniellepresgraves\\mypython\\dna.csv”

print(my_file_test)
print("-----------")
# we are creating the file object my_file_test_pointer
my_file_test_pointer = open(my_file_test,"w+")
print(my_file_test_pointer)
print("~"*15)
# the unimaginably named my_file_test_contents should regurgitate what is 
# actually in your file
my_file_test_contents = my_file_test_pointer.read()
print(my_file_test_contents)
print("*******")

/Users/presgd/MyPythonFiles/dna1.csv
-----------
<_io.TextIOWrapper name='/Users/presgd/MyPythonFiles/dna1.csv' mode='w+' encoding='UTF-8'>
~~~~~~~~~~~~~~~

*******


In [16]:
#import os
#print(os.getcwd())
#"/Users/presgd/MyPythonFiles/output_4.csv"

/Users/presgd/MyPythonFiles/JAXIntroPython1_PRIMM


In [22]:
# we are going to create the file dna.txt NOT dna.csv. dna.txt will be located in the same folder as your Module2B 
f = open("dna.txt", "w")
#the command above created a textfile called 'dna.txt' in 'w' mode. 
# you should be able to check that there is now a file called dna.txt
# -------------------
# since we used "w" and not "w+" as the access mode, we should get an exceptionIO raised if we attempt to
# read the contents of the file as we do in the next line - currently hashed out. 
# If we changed the w to w+, we could read the contents of the file to the screen.
# -------------------
#f.read()
f.write("ATTGTCCCGTu"+"\n")
f.close()
# ---------------------------------------
my_file_name = "dna.txt"
# Note: you can just create a PLAIN TEXT FILE with a random sequence in it or you can create the file using the technique above
print(my_file_name)
#my_file_pointer is the pointer
my_file_pointer = open(my_file_name,"r+")
print(my_file_pointer)
#read my file contents
my_file_contents = my_file_pointer.read()
print(my_file_contents)
my_file_pointer.close()

dna.txt
<_io.TextIOWrapper name='dna.txt' mode='r+' encoding='UTF-8'>
ATTGTCCCGTu



# .rstrip() method

Opening files can sometimes require something _extra_.

For instance, files can contain **hidden** characters such as \n (which means start a newline) which need to be removed. A straightforward way to do this is to use the string method: **_.rstrip(“\n”)_**

There is also a separate method for removing hidden characters from the left side of the string called **_.lstrip()_**. However, that generally isn't relevant for I/O so I mention it merely for completeness.

In [24]:
#open the file that we created above. Notice: when we originally created dna.txt, we gave it access 'w', but
# when we open the file again, we can specify different access. Here we have not given any access mode so 
# the default of Python is to assume "r" as access mode
# -----------------------
my_file=open("dna.txt")

#read contents - you could do this in two lines:
# 1. read in files contents
my_file_contents= my_file.read()
# 2. put file contents into variable WITHOUT the .rstrip(), see what happens: 
#my_DNA=my_file_contents

# or 2. put file contents into variable WITH the .rstrip(), see what happens:
my_DNA=my_file_contents.rstrip("\n")

# -----------------------------
#notice that you can BE MORE EFFICIENT AND append the .rstrip method to the read method.
# --------------------------
#my_DNA = my_file.read().rstrip("\n")
#calculate length by using built in function len(). We will see this method A LOT. 
dna_length=len(my_DNA)
#print output
print("Sequence is "+my_DNA+" and the length is "+str(dna_length) +" bp")
my_file.close()

Sequence is ATTGTCCCGTu and the length is 11 bp


## Writing to an output file
The steps involved in writing to a file: 

1. Open/create file and specify access (or Python will assume r and you won't be able to write to it)
2. Write to file:  .write()
3. Close file: .close() – ensures that output to a file is flushed

In [36]:
#remember the first example we gave and we asked what it did?
# Here it is again but this time you know what it does!
my_list = [i**2 for i in range(1,11)]

my_file = open("output.txt", "w")

for each in my_list:
    my_file.write(str(each)+"\n")

# After running this cell, you could try running it again with the following line hashed out. 
# When you open output.txt, what happens when the .close() is hashed out? 
my_file.close()

## Reading from a file
1. open file 
2. read file 

   a. read one line at a time from file. First time it is called, it reads first line, second time it is called, it reads second line etc. 

            .readline()

   b. read entire file contents

            .readlines()
  
4. close file

In [42]:
#All contents of file are read at once
my_file=open("output.txt","r")
# you could print off only the first 5 characters -- including the \n
print(my_file.read(5))
my_file.close()
print("------------")
#one line at a time of the file contents are read
# the difference between readlines() and readline() methods.
my_file=open("output.txt","r")
print(my_file.readlines())
#print(my_file.readline().rstrip("\n"))
#print(my_file.readline().rstrip("\n"))
#print(my_file.readline().rstrip("\n"))
my_file.close()

1
4
9
------------
['1\n', '4\n', '9\n', '16\n', '25\n', '36\n', '49\n', '64\n', '81\n', '100\n']


## Aside: Why do we need to *__close__* our files? 
Data is buffered during the I/O process which means it is held in a temporary location before being written to the file. 

Python doesn’t flush the buffer (write the buffer to the file) until it is sure that you are done writing so you must close the file or the data won’t be transferred to the correct file.

Digression for when you want to use Python in the wild: 
* Python can automatically close your files for you by using context managers: the two built in methods: __enter__(); __exit__()
* We won't discuss them in this course (we might see them again in Python II when Object oriented programming is used) 


In [59]:
# create a file and open the file for writing
write_file = open("dummy.txt", "w")
# Write to the file
write_file.write("Not closing files is VERY BAD.")
# if we hash out the next line, the buffer won't know to read the line into our dummy.txt file
write_file.close()

# In Class Questions: 
1. (5 minutes) open the Module2_Example.txt file, remove the first line of the FASTA file and then write the sequence to a different file.(This can be 'pseudocode' for now but it will be useful to think about it for your assignment.)
   
2. (10 minutes) Take the FASTA file below and send each of three sequences to different files named seq0, seq1, seq2

>seq0

FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF

>seq1

KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM

>seq2

EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK

---------------------------------
3. (20 minutes) This question combines what we have learned over the last two days.

For the sequence given below:  **5’- ATCGATCGATCGATCGACTGACTAATCATAGCTATGCATGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTAACATCGATCGATATCGATGCATCGACTAGTACTAT-3'**’
 Introns are the noncoding parts of a DNA sequence that are spliced out (removed) from the sequence before it is translated into a chain of amino acids. In the following sequence, there are two exons and one intron. The first exon runs from the start of the sequence to the 63rd character. The second exon runs from the 91st character to the end of the sequence. (hint: in the following sequence of numbers what would the 3rd character:  1 2 3 4)

	Write a program that will split the genomic DNA into coding and non-coding (intron) parts and write these to* two different file*s (a file for exons/coding and a file for the spliced out introns/non-cod** ing). You should be able to have a flexible program that works with any sequence provided by a user but you can assume that the user knows the positions (locations) where the intron/exons are lo*ated.ed. 


### Now back to the basics...
## User input
We are not always going to want to use a ‘hard coded’ path to a file; our program will be more flexible if we can ask a user for the input file

This can lead to a problem, for instance, a file that doesn’t exist (perhaps because it has been mistyped)

There are two main steps when seeking user input: 

        * User input
        * User validation

The **input()** function returns a string. This means that depending on what you are trying to enter you may need to convert it to an str(). 

In [25]:
print("Are you in your first, second, third, fourth or more year of university?: ")
year = input()
print("What is your major?")
major = input()
print(major.upper())
print("What is your favorite class so far?")
fav = input()
print("To summarize, you're in your %r year of university, are pursuing a %r major and you love %r"%(year, major, fav))

#to demonstrate the sometimes necessary conversion to integers: 
print("Now onto inputting integers")
A = input()
B = input()
print(A+B)
# this means that input() brings in user provided variables/info as a string
print("We didn't mean to concatenate them, did we?")
total=int(A)+int(B)
print("Your total of the two numbers is: "+str(total))

Are you in your first, second, third, fourth or more year of university?: 


 infinity


What is your major?


 dogs


DOGS
What is your favorite class so far?


 canine studies


To summarize, you're in your 'infinity' year of university, are pursuing a 'dogs' major and you love 'canine studies'
Now onto inputting integers


 6
 3


63
We didn't mean to concatenate them, did we?
Your total of the two numbers is: 9


In [32]:
# sophisticated example -- What does it do? 
import os.path
# I/O - a ever-so-slightly more sophisticated example
accession = input("Enter the accession name:")
print("Here is the name: " + accession)

# you can enter a filename by input. We need to wait for a response. This should be a file
# that exists so let's use dna.txt or dummy.txt since they are both in our pathway
filename = input("Enter filename: ")

# open filename that user inputted - argv holds strings so you don't need quotes around
# the filename open() creates a file object named txt
#let's you check to see if the file your user inputted actually exists 
# does the file or folder exist - if it does, it should return True
print(os.path.isfile(filename))

txt = open(filename, "a+")
# see what happens when you don't add the "a+" permission by unhashing out the next line .....
#txt=open(filename,"w+")
print("Here is your file name: %r" % filename)
# txt is a file object, print it out to remind yourself of the difference between file name and file object
print("Here is the pointer to your file object: " + str(txt))
# read the text file object, txt, with the .read() method and print it to the screen
file_contents = txt.read()

# There is something strange that happens when you are in a+ permissions - the pointer position is after 
# the last character that is already in the file - so you can append to it instead of overwriting it! This means
# that if we ask it to print the file contents, it will print nothing since the pointer is at the current end.
# We can check where the pointer position with the method, .tell()
print("We can see that we are at the end of the text file. We are here: "+str(txt.tell()))
# we can then reposition the pointer to be at 0 if we wanted to do so - obviously this is a little dangerous
# and for the most part unnecessary. the Method is the .seek() method. 
print("Here are the file contents: ")
print("---------------------------")
# this should print nothing since we are at the end of our text file - the pointer is at the end of the text file -
# and we haven't added anything to it yet. 
print(file_contents)
print("*" * 10)
txt.write("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA_______")
# if you open up the dummy.txt file you should now see that the A's have been appended
txt.close()
# you need to close to flush the file

Enter the accession name: klj


Here is the name: klj


Enter filename:  lkj


False
Here is your file name: 'lkj'
Here is the pointer to your file object: <_io.TextIOWrapper name='lkj' mode='a+' encoding='UTF-8'>
We can see that we are at the end of the text file. We are here: 0
Here are the file contents: 
---------------------------

**********
