# I/O, storing and reading data

In [None]:
### 10.7.2021 - Fall 2021 PHZ3150

In [None]:
import numpy as np

#### Usually the slowest part of codes is the Input/Output part. Careful how many times you do I/O operations


### for example, compare the time it takes to read this:

In [1]:
cf = np.zeros(10663)

In [2]:
%%time

with open('10k_file.dat', 'r') as f:
    j = 0
    for line in f:
        p = line.split()
        cf[ j ] = p[4]
        j = j + 1
        


Wall time: 26 ms


### versus: 

In [3]:
%%time
f = open('10k_file.dat','r')  
my_text = f.read()               
f.close()                        

Wall time: 6 ms


#### You can open a file to:

read: f = open(‘my_file’, ‘r’)

write: f = open(‘my_file’, ‘w’)

append: f = open(‘my_file’, ‘a’)

read/write: f = open(‘my_file’, ‘r+’)



#### -----------------------

#### After you open a file you can do things with the data you read in:


#### read everything in one go: f.read()


#### read line by line: f.readline()

#### Don’t forget to eventually close your file: f.close()


#### Note that after closing the file the f. is released from memory and trying to access it to do anything will give you an error.


#### Better practice for working with a file:

with open(‘my_file') as f: <br>
------ do things here>

file is closed


#### Let's try it out:

#### Let's start by reading in a random text file, all in one go:

In [4]:
f = open('random_text.txt','r')  # open the file in read ('r') mode under name 'f'
my_text = f.read()               # read all of f ( f.read() ) in variable my_text
f.close()                        # close the file f

In [5]:
#let's see what we read:

print(my_text)

Random selection of text taken from the thinkpython2 book:
The innermost statements, fd and lt are indented twice to show that
they are inside the for loop, which is inside the function definition.
The next line, square(bob), is flush with
the left margin, which indicates the end of both the for loop and the function
definition. Inside the function, t refers to the same turtle bob, so t.lt(90)
has the same effect as bob.lt(90). In that case, why not call the parameter bob?
The idea is that t can be any turtle, not just bob, so you could create a
second turtle and pass it as an argument to square:
alice = turtle.Turtle()
square(alice)
Wrapping a piece of code up in a function is called encapsulation. One of the benefits of
encapsulation is that it attaches a name to the code, which serves as a kind of documentation.
Another advantage is that if you re-use the code, it is more concise to call a function
twice than to copy and paste the body!



#### it worked!

#### Now let's try to use the readline():

In [6]:
f = open('random_text.txt','r')   # open the file in read ('r') mode under name 'f'
 
print( f.readline() )             # read a line of f ( f. readline() ) and print it
print( f.readline() )             # read another line of f

f.close()                         # close the file f

Random selection of text taken from the thinkpython2 book:

The innermost statements, fd and lt are indented twice to show that



In [7]:
f.readline()

ValueError: I/O operation on closed file.

#### it reads it line by line...

#### now let's try to open it with the *with open() as f* and read it line by line:

In [8]:
with open("random_text.txt", "r") as f:   # we again open the file as f
    for line in f:                        # we now loop f line by line
        print( line )                       # and we print the line we just read

Random selection of text taken from the thinkpython2 book:

The innermost statements, fd and lt are indented twice to show that

they are inside the for loop, which is inside the function definition.

The next line, square(bob), is flush with

the left margin, which indicates the end of both the for loop and the function

definition. Inside the function, t refers to the same turtle bob, so t.lt(90)

has the same effect as bob.lt(90). In that case, why not call the parameter bob?

The idea is that t can be any turtle, not just bob, so you could create a

second turtle and pass it as an argument to square:

alice = turtle.Turtle()

square(alice)

Wrapping a piece of code up in a function is called encapsulation. One of the benefits of

encapsulation is that it attaches a name to the code, which serves as a kind of documentation.

Another advantage is that if you re-use the code, it is more concise to call a function

twice than to copy and paste the body!



#### What if we want to split the lines into their elements to use them for some reason?

In [9]:
with open("random_text.txt", "r") as f:  # open file as f
    for line in f:                       # start looping the file line by line 
        q = line.split()                 # split the line in its parts ( split() - delimeter space )
        print ( q )                      # print it

        if q[0] == 'they':               # if the first word in the line is 'they' :
            break                        # break out of the loop


['Random', 'selection', 'of', 'text', 'taken', 'from', 'the', 'thinkpython2', 'book:']
['The', 'innermost', 'statements,', 'fd', 'and', 'lt', 'are', 'indented', 'twice', 'to', 'show', 'that']
['they', 'are', 'inside', 'the', 'for', 'loop,', 'which', 'is', 'inside', 'the', 'function', 'definition.']


In [10]:
# what happens if I ask it to print:
print( q[ 0 ], q[ 3 ] )
# and why?

they the


#### Now lets try to write our first file:

In [11]:
f = open('my_first_writen_file.txt','w')                # open a file in write mode ( 'w' )

f.write('This is my first written and saved line! \n')  # write something in the file 

f.close()                                               # close the file

#### try to open the file to read it in and see what you just did:

In [12]:
with open('my_first_writen_file.txt','r') as f:
    a = f.read()  

print( a )   # it worked!

This is my first written and saved line! 



#### Now let's go back to this file and open it in read/write mode:

In [13]:
f = open('my_first_writen_file.txt','r+')   # open the file in read/write mode ('r+')

my_text = f.read()                          # read the file into my_text

f.write('oh oh! what did I just do? \n ')   # write something in the file; \n for new line

f.close()                                   # close the file


print(my_text)                              # print my_text

This is my first written and saved line! 



#### let's open it again and see what we did:

In [14]:
f = open('my_first_writen_file.txt','r')
 
my_text = f.read() 

print(my_text)

f.close()  
#it worked! we read a file and wrote something at the end.

This is my first written and saved line! 
oh oh! what did I just do? 
 


#### why do we need the \n ? Let's open the file again and add 2 new lines:

In [15]:
f = open('my_first_writen_file.txt','r+')   # open the file in read/write mode ('r+')

my_text = f.read()                          # read the file into my_text

f.write( 'What would happen if I forget ' )
f.write( 'to add a new line ? ')
f.close()                                   # close the file


### now let's read it in again and see what we did:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()  
print( my_text )

This is my first written and saved line! 
oh oh! what did I just do? 
 What would happen if I forget to add a new line ? 


#### oops! I wanted it in 2 lines and it just wrote it in 1 ! Let's try again with the \n :

In [16]:
f = open('my_first_writen_file.txt','r+')   # open the file in read/write mode ('r+')

my_text = f.read()                          # read the file into my_text

f.write( "What would happen if I wouldn't forget \n" )
f.write( 'to add a new line ? \n ')
f.close()                                   # close the file


### now let's read it in again and see what we did:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()  
print(my_text)

This is my first written and saved line! 
oh oh! what did I just do? 
 What would happen if I forget to add a new line ? What would happen if I wouldn't forget 
to add a new line ? 
 


#### See what happened? 

#### How you open a file is crucial. Make sure you always check before you run a code.

e.g.,

In [17]:
#Let's open it one more time to write on it again:
f = open('my_first_writen_file.txt','w')
f.write('And I will add this line as well now! \n')
f.close()

# and let's open and read it again:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()   

print(my_text) 

And I will add this line as well now! 



### !!!oops! I completely erased the previous text! Can you see why that happened?

In [18]:
#### Let's fall back to the initial file:

f = open('my_first_writen_file.txt','w')      # open a file in write mode ( 'w' )

f.write('This is my first written line! \n')  # write something in the file 
f.write( 'oh oh! what did I just do? \n' )
f.close()                                     # close the file

#### now lets open it again to append a line:

In [19]:
f = open ('my_first_writen_file.txt','a')         # open the file in appending ('a') mode
f.write('and this is the other line I wrote! \n') # write a line
f.close()                                         # close the file

In [20]:
# and let's open and read it again:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()   

print(my_text) 

This is my first written line! 
oh oh! what did I just do? 
and this is the other line I wrote! 



# -----------------------

## np.savetxt()

### What about having to deal with data instead of text?

In [21]:
#### Let's create a 3 by 3 data array of 1s

data = np.ones ( ( 3 ,3 ) )


#### How do we write the data in the file?

In [22]:
# one way to do it would be to add it as an appended string:

f = open('my_first_writen_file.txt','a')

for i in range(3):
    f.write(str(data[i,:])+'\n')   #note that with the f.write you need to convert the data to a string
                                   #if you keep it an array the code will crash/complain
f.close()


In [23]:
# and let's open and read it again:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()   

print( my_text ) 

This is my first written line! 
oh oh! what did I just do? 
and this is the other line I wrote! 
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]



#### what if we want to save the real data as nunbers? np.savetxt()

In [24]:
#Let's try it out first on its own:

np.savetxt('my_first_numpy_array_saved.txt', data, 
           fmt='%.2f', delimiter=' ')              #notice that you can use fmt to format your output
                    

#let's check what we did:
f = open('my_first_numpy_array_saved.txt','r')
a = f.read()
f.close()

print( a ) 

1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00



In [25]:
#what happens if you want to append the numbers to a text file using savetxt?
# first make a backup copy of my_first_writen_file.txt (trust me)

#try the following. Do you think that it would work? Why/why not?

f = open("my_first_writen_file.txt", "a")
np.savetxt('my_first_writen_file.txt', data, fmt='%.2f', delimiter=' ')
f.close()


#let's see what we did:

f = open('my_first_writen_file.txt','r')
a = f.read()
f.close()

print(a) #!! ouch! it erased your entire file! 

1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00



In [26]:
#what you can do (but you normally will not need to, unless you do something weird to need to do this), is 
#open your file in a binary format:

f=open('my_first_writen_file.txt','ab')

np.savetxt(f,data, fmt='%.2f', delimiter=' ')

f.close()

f = open('my_first_writen_file.txt','r')
a = f.read()
f.close()

print(a) #it worked! 

1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00
1.00 1.00 1.00



### numpy.savetxt is an OK method to store your small arrays. For larger sets you need to delve into pickles/csv ...

In [27]:
np.savetxt('test_2.dat',data, fmt='%.2f', delimiter=' ', 
           header= 'This are my random data')

# ------------------

## Pickles

#### Pickles: converts your input with a binary protocol to serialize/ save your input and to deserialize/open (unpickle) it

### “Never unpickle data received from an untrusted or unauthenticated source”

#### pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again


#### Efficient way to compress data – you can reconstruct complete Python datasets 

#### ----
Some may be familiar with JSON and pickles may sound comparable but keep in mind:

Pickles are not human readable

Pickles are Python specific
#### ----


#### When trying to unpickle a dataset the version of Python used may come into play: pickles are not compatible between Python 2 and 3!


#### Things to remember:
 
import pickle <br>
pickle.dump() <br>
pickle.load() <br>


In [28]:
#Pickles

import pickle


In [29]:

#Let's get a dictionary from our previous demo :

car1 = {
  "model" : "Escape",
  "make" : "Ford"
}
car2 = {
  "model" : "500",
  "make" : "Fiat"
}
car3 = {
  "model" : "Tucson",
  "make" : "Hyundai"
}


all_my_cars = {
  "car1" : car1,
  "car2" : car2,
  "car3" : car3
}


In [30]:
#and now let's open a file to save the data in:

f = open('my_first_pickle.pickle','wb')       # open file in binary mode 

pickle.dump(all_my_cars,f)                    # let's dump our pickled dictionary in there:

f.close()                                     # close the file

##### congrats! you saved your first pickle!

In [31]:
#Let's now see what we saved actually:

pickle_read  = open('my_first_pickle.pickle','rb')  # open and read pickle
example_dict = pickle.load(pickle_read)             # load the pickle in example_dict

In [32]:
# test:

print( example_dict.items()  )
print( example_dict.keys()   )
print( example_dict.values() )


example_dict['car1']['model']

#it worked!

dict_items([('car1', {'model': 'Escape', 'make': 'Ford'}), ('car2', {'model': '500', 'make': 'Fiat'}), ('car3', {'model': 'Tucson', 'make': 'Hyundai'})])
dict_keys(['car1', 'car2', 'car3'])
dict_values([{'model': 'Escape', 'make': 'Ford'}, {'model': '500', 'make': 'Fiat'}, {'model': 'Tucson', 'make': 'Hyundai'}])


'Escape'

#### Saving the dictionary in a pickle preserved its dictionary nature.

#### Sometimes you can have a pickle that you know a priori what data it has in it (how many columns, what each column is...)

In [33]:
# let's make some data up
f_in = np.arange(10)
q_in = f_in**2
u_in = f_in**3

# open a file and store your second pickled data:

f = open('my_second_pickle.pickle','wb')

pickle.dump([f_in, q_in, u_in],f)

f.close()

In [34]:
# now that I know that my data are f_in, q_in, u_in lets unpickle it in one go:

f, q, u = pickle.load( open( 'my_second_pickle.pickle','rb' ) )


In [35]:
# test:
print( f == f_in )
print( q == q_in )

[ True  True  True  True  True  True  True  True  True  True]
[ True  True  True  True  True  True  True  True  True  True]


# ------------------

## CSV

#### CSV: Comma Separated data (Values) --> used to save tabular data such as a spreadsheet or a database


#### Things to remember:

import csv  <br>
csv.reader() <br>
csv.writer()<br>


In [36]:
#Let's now do some csv reading/writing:

import csv 

In [37]:
## as a test case we will use the dictionary from before:

with open('my_first_csv.csv', 'w') as f:     # open the file to write in it

    writer = csv.writer(f)                   # you will use the csv module to write the data
    
    for key, value in example_dict.items():  # loop over items in your dictionary 
        
        writer.writerow([key, value])        # write the items

In [38]:
#open the file from your Jupyter notebook tree: it is human readable (unlike the pickle...)

#let's read it back in:

with open('my_first_csv.csv') as f:
    reader = csv.reader(f)
    my_csved_dict = dict(reader)

    
print(my_csved_dict.items())

print(type(my_csved_dict))



ValueError: dictionary update sequence element #1 has length 0; 2 is required

In [39]:
#note though that in this example we have lost the structure of the nested dictionary:
#my_csved_dict['car1']['make'] will give you an error:

print( my_csved_dict['car1']['make'] )

NameError: name 'my_csved_dict' is not defined

# ------------------

## Pandas

#### A great tool for data analysis and modeling of large datasets (think ML sizes….). It’s a software library that can read big amounts of data and analyze it fast


#### Can read in CSV files, SQL databases and create a Python object with rows and columns (data frame) out of that – makes working on such data faster that using tuples/dictionaries





In [40]:
## Last but not least, let's try Pandas out
import pandas as pd 

In [41]:
# we will use the pickled data from above here:

pd.read_pickle('my_first_pickle.pickle')  # reads it in and shows you the nested dictionaries


{'car1': {'model': 'Escape', 'make': 'Ford'},
 'car2': {'model': '500', 'make': 'Fiat'},
 'car3': {'model': 'Tucson', 'make': 'Hyundai'}}

In [42]:
pd.read_csv('my_first_csv.csv')           # reads it in and shows you the table

Unnamed: 0,car1,"{'model': 'Escape', 'make': 'Ford'}"
0,car2,"{'model': '500', 'make': 'Fiat'}"
1,car3,"{'model': 'Tucson', 'make': 'Hyundai'}"


In [43]:
#Now let's make a panda dataframe out of the pickle:

pd.DataFrame(pd.read_pickle('my_first_pickle.pickle'))


Unnamed: 0,car1,car2,car3
model,Escape,500,Tucson
make,Ford,Fiat,Hyundai


In [44]:
# let's read in a dataframe the second pickle with the random data

pd.DataFrame( pd.read_pickle('my_second_pickle.pickle') )  #it gives you the columns and their values: 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0,1,2,3,4,5,6,7,8,9
1,0,1,4,9,16,25,36,49,64,81
2,0,1,8,27,64,125,216,343,512,729


In [45]:
# read the data in a dataframe named df:
df = pd.DataFrame(pd.read_pickle('my_second_pickle.pickle')) 

#ask it to describe() your 
df.describe()  # summary statistics for numerical columns ; not very useful here but 
               # imagine what you can do with LARGE datasets!

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
mean,0.0,1.0,4.666667,13.0,28.0,51.666667,86.0,133.0,194.666667,273.0
std,0.0,0.0,3.05505,12.489996,31.749016,64.291005,113.578167,183.073756,276.24144,396.545079
min,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
25%,0.0,1.0,3.0,6.0,10.0,15.0,21.0,28.0,36.0,45.0
50%,0.0,1.0,4.0,9.0,16.0,25.0,36.0,49.0,64.0,81.0
75%,0.0,1.0,6.0,18.0,40.0,75.0,126.0,196.0,288.0,405.0
max,0.0,1.0,8.0,27.0,64.0,125.0,216.0,343.0,512.0,729.0


In [46]:
#get the mean of all columns:
print( df.mean() )

0      0.000000
1      1.000000
2      4.666667
3     13.000000
4     28.000000
5     51.666667
6     86.000000
7    133.000000
8    194.666667
9    273.000000
dtype: float64


In [47]:
#you can get correlations between columns: 
print( df.corr() )

#max (& min) of each column:
#df.max()   #& df.min()

#get the standard deviation of each column:
#df.std()

#and much much more...   If you are interested in playing around with big data there's plenty 
#of Open Access databases

    0   1         2         3         4         5         6         7  \
0 NaN NaN       NaN       NaN       NaN       NaN       NaN       NaN   
1 NaN NaN       NaN       NaN       NaN       NaN       NaN       NaN   
2 NaN NaN  1.000000  0.995871  0.989743  0.984324  0.979864  0.976221   
3 NaN NaN  0.995871  1.000000  0.998625  0.996271  0.993944  0.991870   
4 NaN NaN  0.989743  0.998625  1.000000  0.999424  0.998337  0.997176   
5 NaN NaN  0.984324  0.996271  0.999424  1.000000  0.999719  0.999151   
6 NaN NaN  0.979864  0.993944  0.998337  0.999719  1.000000  0.999847   
7 NaN NaN  0.976221  0.991870  0.997176  0.999151  0.999847  1.000000   
8 NaN NaN  0.973223  0.990072  0.996078  0.998508  0.999522  0.999910   
9 NaN NaN  0.970725  0.988522  0.995082  0.997871  0.999137  0.999711   

          8         9  
0       NaN       NaN  
1       NaN       NaN  
2  0.973223  0.970725  
3  0.990072  0.988522  
4  0.996078  0.995082  
5  0.998508  0.997871  
6  0.999522  0.999137  
7  0

### ---- PRACTICUM:

## let's do some practice; write to a file using numpy:

#### Make a 5 by 5 numpy array of zeros. Set x equal to a numpy range from 5 to 10; and y the exponential of x. Populate your 5 by 5 array with the product of x * y (double for loop). Save the 5 by 5 array to a file named my_radom_data.dat. Add a header like "These are my random data" and format the output to have 4 digit accuracy. 

In [48]:
my_random_data = np.zeros( ( 5, 5 ) )

x = np.arange( 5, 10 )
y = np.exp( x )

for i in range( 5 ):
    for j in range( 5 ):
        my_random_data[ i, j ] = x[ i ] * y[ j ]

In [49]:
# let's save it with np.savetxt() and add a header with some info:

np.savetxt( 'my_random_data.dat' , my_random_data, header = 'These are my random data', 
          fmt = '%.4f' )

In [None]:
#check it out

In [50]:
tst_data = np.genfromtxt( 'my_random_data.dat', comments='#' )

In [51]:
print( tst_data )

[[  742.0658  2017.144   5483.1658 14904.7899 40515.4196]
 [  890.479   2420.5728  6579.799  17885.7479 48618.5036]
 [ 1038.8921  2824.0016  7676.4321 20866.7059 56721.5875]
 [ 1187.3053  3227.4303  8773.0653 23847.6639 64824.6714]
 [ 1335.7184  3630.8591  9869.6984 26828.6219 72927.7553]]


### now write the same data but to a csv file:

In [52]:
with open('my_random_data.csv', 'w') as f:     # open the file to write in it

    writer = csv.writer(f)                   # you will use the csv module to write the data
    
    for i in range( len( x ) ) :  # loop over items in your dictionary 
        
        writer.writerow( my_random_data[i, : ] )        # write the items

In [53]:
#let's open to read the first 3 lines of file emma.txt;

f = open('emma.txt', 'r')

q = f.readline( ).split( )
l = f.readline( ).split( )
m = f.readline( ).split( )

f.close()



In [54]:
print( q )

['******The', 'Project', 'Gutenberg', 'Etext', 'of', 'Emma,', 'by', 'Jane', 'Austen******']


In [55]:
print( l )
print( m )

[]
['Please', 'take', 'a', 'look', 'at', 'the', 'important', 'information', 'in', 'this', 'header.']


In [56]:
f = open('emma.txt', 'r')

n = f.read().split()

f.close()

In [57]:
n.count('the')

4862

In [58]:
n.count('Emma')

481

### Let's make boxes with sides x, y, z and calculate their volume. Create array x that is a range from 0.1 to 10 with a step of 0.2. Make array y that is a linspace from 0.1 to 10 in 50 steps. Then make array z that is equal to $x^2 + y^{0.3}$. Calculate the volume of the boxes [assume it is in cu ft ] with every possible combinations of x, y, z. What is the maximum, minimum and mean volume of our boxes in ? If I need boxes with a volume more than 8000 cu ft, how many boxes will I have?

In [59]:
def split_sentence(sentence):
    """splits a string into separate words and returns a tuple of the shortes and longest words."""
    split_string = sentence.split()
    lengths = []
    for i in split_string:
        lengths.append(len(i))
    shortest = split_string[lengths.index(min(lengths))]
    longest = split_string[lengths.index(max(lengths))]
    short_long = (shortest, longest)
    return short_long
    

### Make a function that gets as input any string, splits it in the words it is made of and returns a tuple with the shortest and longest word. Call it for the sentence: "If you are trying to come up with a new concept, a new idea or a new product, a random sentence may help you find unique qualities you may not have considered"

In [61]:
print(split_sentence('If you are trying to come up with a new concept, a new idea or a new product, a random sentence may help you find unique qualities you may not have considered'))

('a', 'considered')
