###  Summary of previous notebook 
1. Data types and their associated methods
    * format: variable_name.method()
    * special tricks such as variable_name.[tab key] will bring up method options for data type of variable_name
    
2. special manipulations with strings, ie. slicing, escape characters

Remember that there will be lots of commands that are hashed out in the cells. I have usually done this to demonstrate that the command will result in an error or, occasionally, to show an alternative way to do something. You can unhash these lines as you follow along with the logic being shown. 

# Lecture 3 - I/O

A useful program will usually require data input to manipulate and will spit out the results of this manipulation. We have learned a tiny bit of some manipulations with strings. 

In today's lecture, we will learn how to open, read and write files - sometimes even multiple files at multiple locations. 

We will use "flat files" (ie. no structured relationship) for this course until we are introduced to SQLite format. This means mostly text files (.txt) or comma-separated files (.csv). 

### What happens if we want to read in a file and then write to a different file?  

What does the following program do? 

In [3]:
# we will look at what this list comprehensions means in a few lectures (and module 3) but,
# for now, all you need to know is that we are creating a list. You can look back on this
# cell in a few weeks and think, 'oh, yeah. That was easy'. But you probably don't think 
# that right at this second...unless you already happen to know a programming language. 

my_list = [i**2 for i in range(1,11)]
print(my_list)
#my_list=["AT","CG","cg"]
# what happens if you use this line, where "w" isn't specified? 
#f = open("output_test_1.txt") 

#Does this provide a hint about what "w" is doing?
# ---------------------
# CHECK YOUR FOLDER AND SEE IF THERE IS A NEWLY CREATED FILE CALLED output_test_1.txt
# ---------------------

f = open("output.txt","w+")

#if you check, you might notice that there is a file in the same path as this notebook
#called 'output.txt' to determine what this program does. We haven't learned
# about loops yet but hopefully you can
# figure it out based on the code.... of course, you can also just open 'output.txt' file.

for item in my_list:
    f.write(str(item) + "\n")
f.close()

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


## Reading Text from files

We will need to read information from a file such as a FASTA file (https://en.wikipedia.org/wiki/FASTA_format
):

\>seq0

FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF

\>seq1

KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM

\>seq2

EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK

Or we want to write information to a file such as the results from a BLAST search.
How do we accomplish such tasks? 

**The _major_ tasks that we want to be able to do when manipulating text files include:**

1. Ensure that the file exists manually *and that it is in the correct path*
2. Open the file <- Right now, we have to rely on raising an IOexception when we try to open a file that doesn’t exist, python will give us [Errno 2]: no such file or directory. We’ll learn about this later but there are tools to ensure that a file exists: Try/except system.
3. Read file (or write to the file, or a different file, or read and write to the file)<-method for file object
4. Close file <- method for file object

### Procedure for _reading_ a file: 
1. Check file exists
2. Open file
3. Read file
4. Close file

### Procedure for _writing to_ an _existing_ file: 
1. Check file exists
2. Open file
3. _Append_ text to file
4. Close file

### Procedure for _writing to_ or _creating_ a _NEW_ file: 
1. Open (really, create) file
2. Write to file
3. Close file


In [75]:
#How do we open a file?
#-------------------------------
# the format that we use is:
# variable_name = open(path_if_necessary_and_file_name,mode)
# where mode = append, read, write (see next cell for text explanation)
# ---------------------------

f = open("output.txt", "w+")

#the command above create a textfile called 'output.txt' in 'w' mode. 
#However, in the cell above, we created an output.txt file, too. What happened?
# you should be able to check that there is now a file called output.txt
# the new file **object** is stored as the f file object which now 
# has **attributes and methods **
# instead of a filename,"output.txt", you can insert a path name - although we
# won't worry about that so far in this course

f.write("Hello World"+"\n")
f.close()

In [76]:
# if you created the same file but with 'r+' mode, the output.txt file would also be 
# different: 
my_list = [i**2 for i in range(1,11)]
# recreate the original output.txt file: 
# ---------------------------------
f = open("output.txt","w")
for item in my_list:
    f.write(str(item) + "\n")
f.close()
# ---------------------------------
# the following should ONLY overwrite the first 11 characters - including all of 
# the '\n' that were added so that each number appeared on a different line - in the output.txt

my_file=open("output.txt","r+")
my_file.write("Hello World"+"\n")
# as an example, what do you expect the following to give you? Why?#
print(my_file)
my_file.close()

<_io.TextIOWrapper name='output.txt' mode='r+' encoding='UTF-8'>


## Permissions for the open method
r – default mode, cursor placed at beginning of file and can only read text (not write to a file)

r+– read/write but doesn’t overwrite the entire file (to see what this means, replace the "w+" with "r+" and see what happens to the output file that was created in cell 1; there will be a difference). The cursor is placed at the beginning of the file so there will be some overwriting but the file won't be entirely overwritten as it would be with w+ 

w+ – opens a file for both reading and writing; overwrites file (if it exists), creates new file if it doesn’t exit<- 
as you can guess, this is a little dangerous so use with caution

a – opens file for appending, file cursor at end of file (if it exists), at beginning of file if it needs to create a new file  

a+ - opens file for appending and reading

### Pointers: 

https://xkcd.com/138/

### The subtle differences between file name, file object and file contents
* File object <--Represents the file itself. You use methods like .read on file objects and ONLY on file objects (you will get an attribute error term if you attempt to use .read method on something that isn't a file object)

* File name<--String with the name of the file

* File content<--String that stores text of the file itself; .read()

In [78]:
# my suggestion is that if are using a file to put it in the same directory/folder as the 
# jupyter notebook that is accessing it.if not, you can always type the full path to the file, it is 
# just a bit more complicated on the first step. 
# Note the file path will look different if 
# you are on a PC.
#The file path below is for my mac os. 
my_file_test = "/Users/daniellepresgraves/mypython/dna.csv"
# the same file path for windows should look something like this: 
# ---------------------------------
# Note: most things in Anaconda should be the same regardless of operating system but
# I am working on a MAC so you might find that windows has a different path system than
# what I have written here. I can't test this path because I am not on Windows. 
# ---------------------------------
#my_file_test = “\Users\daniellepresgraves\mypython\dna.csv”

print(my_file_test)
print("-----------")
# we are creating the file object my_file_test_pointer
my_file_test_pointer = open(my_file_test)
print(my_file_test_pointer)
print("~"*15)
# the unimaginably named my_file_test_contents should regurgitate what is 
# actually in your file
my_file_test_contents = my_file_test_pointer.read()
print(my_file_test_contents)
print("*******")
# the following is dna.txt NOT dna.csv and dna.txt should be located in the same folder
# as your lecture_2B notebook if you want to avoid having to put in the full path
my_file_name = "dna.txt"
#Note: you can just create a PLAIN TEXT FILE with a random sequence in it. 
print(my_file_name)
#my_file_pointer is the pointer
my_file_pointer = open(my_file_name)
print(my_file_pointer)
#read my file contents
my_file_contents = my_file_pointer.read()
print(my_file_contents)
my_file_pointer.close()

/Users/daniellepresgraves/mypython/dna.csv
-----------
<_io.TextIOWrapper name='/Users/daniellepresgraves/mypython/dna.csv' mode='r' encoding='UTF-8'>
~~~~~~~~~~~~~~~
ATGCGCGTAGAGCTTTTTTTGGGGGGGAAAA
*******
dna.txt
<_io.TextIOWrapper name='dna.txt' mode='r' encoding='UTF-8'>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA



## .rstrip() method

Opening files can sometimes require something _extra_.

For instance, files can contain **hidden** characters such as \n (which means start a newline) which need to be removed. A straightforward way to do this is to use the string method: **_.rstrip(“\n”)_**

There is also a separate method for removing hidden characters from the left side of the string called **_.lstrip()_**. However, that generally isn't relevant for I/O so I mention it merely for completeness.

In [80]:
#open the file
my_file=open("dna.txt")

#read contents - you could do this in two lines:
# 1. read in files contents
my_file_contents= my_file.read()
# 2. put file contents into variable WITHOUT the .rstrip(), see what happens: 
#my_DNA=my_file_contents

# or 2. put file contents into variable WITH the .rstrip(), see what happens:
my_DNA=my_file_contents.rstrip("\n")

# -----------------------------
#notice that you can BE MORE EFFICIENT AND append the .rstrip method to the read method.
# --------------------------
#my_DNA = my_file.read().rstrip("\n")
#calculate length by using built in function len(). We will see this method A LOT. 
dna_length=len(my_DNA)
#print output
print("Sequence is "+my_DNA+" and the length is "+str(dna_length) +" bp")
my_file.close()

Sequence is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA and the length is 31 bp


## Writing to an output file
The steps involved in writing to a file: 

1. Open/create file with permissions
2. Write to file:  .write()
3. Close file: .close() – ensures that output to a file is flushed


In [81]:
#remember the first example we gave and we asked what it did?
# Here it is again but this time you know what it does!
my_list = [i**2 for i in range(1,11)]

my_file = open("output.txt", "w")

for each in my_list:
    my_file.write(str(each)+"\n")

# After running this cell, you could try running it again with the following line
#hashed out. When you open output.txt, what happens when the .close() is hashed out? 
my_file.close()

## Reading from an output file
1. open file 
2. read file 
    a. read one line at a time from file. First time it is called, it reads first line, second time it is called, it reads second line etc. 
            .readline()
    b. read entire file contents
            .readlines()
3. close file

In [83]:
#All contents of file are read at once
my_file=open("output.txt", "r")
# you could print off only the first 5 characters -- including the \n
print(my_file.read(5))
my_file.close()

1
4
9


In [84]:
#one line at a time of the file contents are read
# the difference between readlines() and readline() methods.
my_file=open("output.txt","r")
#print(my_file.readlines())
#print(my_file.readline().rstrip("\n"))
#print(my_file.readline().rstrip("\n"))
#print(my_file.readline().rstrip("\n"))
my_file.close()

1
4


Discussion Thread Exercise on Bb: 
________________________

(Don't need to do the first two parts of this, the sequence has been included along with your lecture notes in module 2)
* go to ncbi and conduct a BLAST search( https://blast.ncbi.nlm.nih.gov/Blast.cgi ) for U81861 (This includes two protein for flagella in Salmonella) 
* download the aligned DNA sequence in FASTA format (the complete sequence is really long) and save the FASTA file in the same path as this lecture (so it can access it).
(This is where you should start):
* open the file, remove the first line of the FASTA file and then write the sequence to a different file.(This can be 'pseudocode' for now but it will be useful to think about it for your assignment.)  

## Why do we need to *__close__* our files? 
Data is buffered during the I/O process which means it is held in a temporary location before being written to the file. 

Python doesn’t flush the buffer (write the buffer to the file) until it is sure that you are done writing so you must close the file or the data won’t be transferred to the correct file.

Digression for when you want to use Python in the wild: Python can automatically close your files for you by using the two built in methods: __enter__(); __exit__() 


In [85]:
#Open the file for reading
read_file = open("dummy.txt", "r")

# Use a second file handler to open the file for writing
write_file = open("dummy.txt", "w")
# Write to the file
write_file.write("Not closing files is VERY BAD.")
# if we hash out the next line, the buffer won't know to read the line into our dummy.txt file
write_file.close()

# Try to read from the file
print(read_file.read())
read_file.close()

Not closing files is VERY BAD.


In [86]:
# what happens when we use the same example as above but don't close our file? 
#Open the file for reading
read_file = open("dummy.txt", "r")

# Use a second file handler to open the file for writing
write_file = open("dummy.txt", "w")
# Write to the file
write_file.write("Not closing files is VERY BAD.")
#*******
#write_file.close()
#*******
# Try to read from the file but it doesn't read - why? 
print(read_file.read())
read_file.close()




### A bit of a digression but you may need it at some point:
## \__enter\__ and \__exit\__ methods
* These are context manager in classes <- we will learn what classes are in OOP in Lecture 14
* We don't want to focus too much on these methods right now, since that would lead us into a digression about OOP before we are ready
* However, you might encounter these ideas as you are out in the world so I have included this just in case.
* briefly, context managers allow setup and cleanup behaviours on objects when created with a __with__ statement
* Get rid of a lot of the opening and closing boilerplate by using these methods: \__exit\__() automatically closes the file
* We can invoke \__exit\__() by using the **with** and **as** keywords
Example syntax:

with open("text.txt", "w") as textfile:

	textfile.write("Success!")


In [70]:
class File():
    def __init__(self, file_name, method):
        self.file_obj = open(file_name, method)
    def __enter__(self):
        return self.file_obj
    def __exit__(self, type, value, traceback):
        self.file_obj.close()
#Just by defining __enter__ and __exit__ methods we can use it in a with statement. Let’s try:

with File('demo.txt', 'w') as opened_file:
    opened_file.write('Hola!')
#this should create and write Hola to a file called demo.txt in the same path as your 
# Lecture_2B notebook
#Our __exit__ function accepts three arguments. They are required by every
# __exit__ method which is a
#part of a Context Manager class. Let’s talk about what happens under-the-hood.

#The with statement stores the __exit__ method of File class. 
# It calls the __enter__ method of File class.
#__enter__ method opens the file and returns it.
# The opened file handle is passed to opened_file. We write to
# the file using .write()

#with statement calls the stored __exit__ method.

#the __exit__ method closes the file.
   

## Need to test whether file is open or closed
Python file objects have a closed attribute which is True when the file is closed and False when it is open

Example syntax: 

    File_object.closed
    
This will be more useful when we discuss logic/Boolean


In [72]:
with open ("text.txt","w") as my_file: 
    my_file.write("Fisher's 1918 paper is hard. ")
if my_file.closed == False:
    my_file.close()
print(my_file.closed)

True


### Now back to the basics...
## User input
We are not always going to want to use a ‘hard coded’ path to a file; our program will be more flexible if we can ask a user for the input file

This can lead to a problem, for instance, a file that doesn’t exist (perhaps because it has been mistyped)

There are two main steps when seeking user input: 

        * User input
        * User validation

The **input()** function returns a string. This means that depending on what you are trying to enter you may need
to convert it to an int(). 

In [10]:
print("Are you in your first, second, third, fourth or more year of university?: ")
year = input()
print("What is your major?")
major = input()
print(major.upper())
print("What is your favorite class so far?")
fav = input()
print("To summarize, you're in your %r year of university, are pursuing a %r major and you love %r"%(year, major, fav))

#to demonstrate the sometimes necessary conversion to integers: 
print("Now onto inputting integers")
A = input()
B = input()
print(A+B)
# this means that input() brings in user provided variables/info as a string
print("We didn't mean to concatenate them, did we?")
total=int(A)+int(B)
print("Your total of the two numbers is: "+str(total))

Are you in your first, second, third, fourth or more year of university?: 
infinite
What is your major?
bio and math
BIO AND MATH
What is your favorite class so far?
this one
To summarize, you're in your 'infinite' year of university, are pursuing a 'bio and math' major and you love 'this one'
Now onto inputting integers
1
2
12
We didn't mean to concatenate them, did we?
Your total of the two numbers is: 3


In [73]:
import os.path
# I/O - a ever-so-slightly more sophisticated example
accession = input("Enter the accession name:")
print("Here is the name: " + accession)

# you can enter a filename by input. We need to wait for a response. This should be a file
# that exists so let's use dna.txt or dummy.txt since they are both in our 
# lecture_2B pathway
filename = input("Enter filename: ")

# open filename that user inputted - argv holds strings so you don't need quotes around
# the filename open() creates a file object named txt
#let's you check to see if the file your user inputted actually exists 
# does the file or folder exist - if it does, it should return True
print(os.path.isfile(filename))

txt = open(filename, "a+")
# see what happens when you don't add the "a+" permission by unhashing out the next line .....
#txt=open(filename,"w+")
print("Here is your file name: %r" % filename)
# txt is a file object, print it out to remind yourself of the difference between file name and file object
print("Here is the pointer to your file object: " + str(txt))
# read the text file object, txt, with the .read() method and print it to the screen
file_contents = txt.read()

# There is something strange that happens when you are in a+ permissions - the pointer position is after 
# the last character that is already in the file - so you can append to it instead of overwriting it! This means
# that if we ask it to print the file contents, it will print nothing since the pointer is at the current end.
# We can check where the pointer position with the method, .tell()
print("We can see that we are at the end of the text file. We are here: "+str(txt.tell()))
# we can then reposition the pointer to be at 0 if we wanted to do so - obviously this is a little dangerous
# and for the most part unnecessary. the Method is the .seek() method. 
print("Here are the file contents: ")
print("---------------------------")
# this should print nothing since we are at the end of our text file - the pointer is at the end of the text file -
# and we haven't added anything to it yet. 
print(file_contents)
print("*" * 10)
txt.write("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA_______")
# if you open up the dummy.txt file you should now see that the A's have been appended
txt.close()
# you need to close to flush the file

Enter the accession name:adkjh9
Here is the name: adkjh9
Enter filename: dna.js
False
Here is your file name: 'dna.js'
Here is the pointer to your file object: <_io.TextIOWrapper name='dna.js' mode='a+' encoding='UTF-8'>
We can see that we are at the end of the text file. We are here: 0
Here are the file contents: 
---------------------------

**********
