# Molecular Modelling Exercises

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

---

# Working with files

Let's create a list of all numbers up to 100 and its squares and store it in a file.

In [1]:
# create the list
squares = []
for i in range(1,101):
    squares.append([i, i**2]) # append the square of the current number to squares

squares[-10:] # display the last 10 values

[[91, 8281],
 [92, 8464],
 [93, 8649],
 [94, 8836],
 [95, 9025],
 [96, 9216],
 [97, 9409],
 [98, 9604],
 [99, 9801],
 [100, 10000]]

In [2]:
# create file squares.csv and open it in `w`rite mode. if this file already exists, it will be overwritten!
file = open('squares.csv', 'w') 
for square in squares:
    # this creates a string of all elements in square connected by a ','
    file_text = ', '.join(str(x) for x in square) 
    # we add a new line (\n) after each file
    file.write(file_text + '\n') 

# when we are done, we need to close the file
file.close() 

You can now look in your file explorer and see that you indeed created the file `squares.csv`.
The problem with this method is that if you forget the file.close() at the end, or something in your code crashes before you close the file,
you won't be able to access that file until you reboot your computer on some systems.

To handle this more elegantly we can use the `with` statement which makes sure that the file will definitely be closed, no matter what (this is called a context manager). The file will stay open as long as code that is indented once below the `with` keyword is executed.

Now let's try to read the data from the file we just wrote to.

In [3]:
# open file squares.csv in 'r'ead mode
with open('squares.csv', 'r')  as file:
    # we can read all the lines into a list
    data_from_file = file.readlines()

data_from_file[0], type(data_from_file[0])

('1, 1\n', str)

As we can see now, the data read from the file is in the string format. The number `i` and its corresponding square is separated by a comma and at the end of the line there is a `\n` which corresponds to a new line sign we added at the top. If we didn't add this sign, all the numbers would be in one single line and we had a harder time, restoring our data into the desired format.

Before we can do anything useful with the data stored in the text file, we need to clean it up by:
 - splitting the line at `','` so we have the two numbers separately
 - remove excess white space (new_line characters `\n`, tabs `\t`, spaces ` `, ..)
 - convert to a useful datatype for us (in this case `int`)

In [4]:
clean_data = []
for entry in data_from_file:
    # whith the str.split() method we can split the string at the occurance of the provided string
    # ',' in this case
    i, i_square = entry.split(',')
    
    # with the str.strip() method we can remove excess white space
    i_clean = i.strip()
    i_square_clean = i_square.strip()
    
    # we convert the strings to integers and store them together in a list
    clean_values = [int(i_clean), int(i_square_clean)]
    
    # we append this list to our clean data list
    clean_data.append(clean_values)

In [5]:
clean_data == squares

True

As we can see, the data in both the original `squares` and the read in `clean_data` is equal. So we successfully wrote data to a file and read data in from a file!

## Exercise 1

Read in data from `exercise1.csv` and calculate the:
- sum of all values and store it in the value `sum_of_values`
- lowest value and store it in `lowest`
- highest value and store it in `highest`

In [17]:
values = []
with open('exercise1.csv', 'r') as file:
    # YOUR CODE HERE
    data_from_file = file.readlines()

for entry in data_from_file:
    entry_clean = entry.strip()
    entry_clean = float(entry_clean)
    values.append(entry_clean)

print(cleaned_values)
    
sum_of_values = sum(cleaned_values)
print(sum_of_values)

lowest = min(cleaned_values)

highest = max(cleaned_values)

[-84.0, -1.5857864376269049, -14.267949192431123, -3.0, -60.763932022500214, -68.55051025721683, -5.354248688935409, -72.17157287525382, -33.0, -36.83772233983162, -77.6833752096446, -71.53589838486225, 6.60555127546399, -59.258342613226056, -14.127016653792584, 17.0, -34.87689437438234, -55.757359312880716, -76.64110105645932, -32.52786404500042, 15.582575694955839, -3.3095842401765703, -22.20416847668728, -24.101020514433642, -53.0, -7.9009804864072155, 1.196152422706632, 29.29150262212918, -64.6148351928655, 19.477225575051662, -21.432235637169978, -22.34314575050762, 11.744562646538029, 27.8309518948453, 15.916079783099615, -29.0, 26.08276253029822, 2.164414002968976, -45.7550020016016, 19.32455532033676, -37.59687576256715, -45.51925930159214, 43.557438524302, -20.3667504192892, 10.70820393249937, -2.217670016874732, 52.855654600401046, -22.071796769724493, -21.0, 22.071067811865476, -32.85857157145715, -18.78889744907202, 40.28010988928052, -9.651530771650465, -13.583801512904337

In [18]:
assert len(values) == 4000
assert all(isinstance(val, float) for val in values)
assert lowest < highest

# EXTRA
## handling huge files

If you have huge files, the data may not fit into memory, so the above procedure can lead to memory errors and computer crashes. In these cases, one may want to lazily handle data processing, by only requesting the next chunk of data, when you are ready to handle it.

The code below is commented out, because it will take quite some time to execute.

In [19]:
with open('huge.csv', 'w') as huge_file:
     huge_file.writelines(f'{x}\n' for x in range(100_000_000))

In [31]:
 with open('huge.csv', 'r') as huge_file:
    lines = (line.strip() for line in huge_file) # lazy file loading through generator expressions
    lines = (line for line in lines if line)
    lines = (int(line) for line in lines)
     # further processing if required
    print(sum(lines))

4999999950000000


In [28]:
def process(file):
    for line in file:
        line = line.strip()
        if line:
            yield int(line)

In [29]:
with open('huge.csv', 'r') as f:
    list(process(f))