# Interacting, reading, writing files in python
## Contents
- basic access
- csvs
- regex

In [None]:
import numpy
import pandas
import os

***
***

## Basic Access

Files can be opened with the builtin function `open`. We must specify what we want to do with that file, eg. read, write.

In [None]:
help(open)

### Reading a file:
1. Open the file
2. Read its lines
3. Parse whatever you want from those lines
4. Close it

In [None]:
# open it with read permissions
myfile = open('example.txt', 'r')

In [None]:
# read it
lines = myfile.readlines()

In [None]:
# close it
myfile.close()

__Note__: for very large files, or ones which the information you care about is at the start, see `readline` to read one line at a time instead of loading the whole file into memory.

In [None]:
print(f'Number of lines read: {len(lines)}')

In [None]:
lines

> It is a list of lines! The `\n` is a string that indicates a new line in the file eg. an "enter"

In [None]:
for line in lines:
    print(line)

***
__Let's get some data from it.__

In [None]:
found_data = False
data = []
for line in lines:
    # start counting when we find data
    if line.startswith('Below is'):
        found_data = True
        continue                    # skip this line
    elif line.startswith('End of data'):
        break                       # we are done, break out of loop
    
    # once we've found data start recording
    if found_data:
        data.append(line.split()[0])
data = numpy.array(data).astype(int)

In [None]:
data

***
__A more complex case__

In [None]:
def load_xyz(filename):
    """Reads information from an xyz file.
    input : str, path to xyz
    returns: 
        list : atom symbols
        array : atom positions
    """
    try:
        isinstance(filename,str)
    except:
        print("Missing filename - data can't be loaded!!", sys.exc_info()[0])
        raise
    xfile = open(filename, "r")
    lists = [line.strip().split() for line in xfile.readlines()]
    xfile.close()
    
    natoms_file=int(lists[0][0])
    natoms_parsed=len(lists)-2

    assert natoms_file == natoms_parsed,\
        "Atom number in xyz file does not match number of atoms!"

    atom_name_list=[]
    coords_list=[]

    for i in range(2, int(natoms_file)+2):
        atom_name_list.append(lists[i][0])
        coords_tmp = [float(item) for item in lists[i][-3:]]
        coords_list.append((coords_tmp))
    
    positions = numpy.array(coords_list)
    
    return atom_name_list, positions

In [None]:
atoms, positions = load_xyz('example.xyz')

In [None]:
atoms

In [None]:
positions

### Writing to file:
1. Open it
2. Write your lines
3. close it

In [None]:
myfile = open('myfile.dat', 'w')

In [None]:
for datapoint in data:
    # files can only contain strings, so we have to convert
    myfile.write(str(datapoint))

In [None]:
myfile.close()

> It just pasted them all in one line, when we propbably want to store each datapoint seperately!

In [None]:
os.remove('myfile.dat')

In [None]:
myfile = open('myfile.dat', 'w')

In [None]:
for datapoint in data:
    # files can only contain strings, so we have to convert
    myfile.write(str(datapoint))
    myfile.write('\n')

In [None]:
myfile.close()

In [None]:
os.remove('myfile.dat')

***
***

## Pandas and csv files

Pandas is a python library that was created to handle data, and it is a very useful tool for that purpose. Let's say we have an a number of dependant variables, we would like to represent different datapoints for those variables as an excel-like

In [None]:
df = pandas.DataFrame({
    'x': data
})

In [None]:
df 

Add another variable.

In [None]:
df['y'] = df['x'].apply(lambda x: x**2+1)

In [None]:
df

> We have an x and a y variable related to eachother, and we have 7 datapoints. We can have as long and as wide a dataframe as we want.

### Tangent: Pandas can do a ton of cool shit

We can get stats on our data

In [None]:
df.describe()

We can plot it

In [None]:
df.plot.scatter(x='x', y='y')

We can bin it

In [None]:
df.groupby(
    pandas.cut(df['y'], bins=5)
).count()

And on and on...

### Back on track: save our data as csv, load a csv
Its super easy.

In [None]:
df.to_csv('mydata.csv')

In [None]:
another_df = pandas.read_csv('mydata.csv', index_col=0)
another_df

***
## Regex: find patterns in text

In [None]:
import re

In [None]:
toy_string = """
Stuff I don't care about
------------------------
>> Some more things
None
Of
This
Matters

| Data |
    1
    2
    5
    1

------------------------
Other Things
>> Don't care
Total value: 20.1

Thank you for looking at this string
End of file.
"""

We can extract things from this file by defining patters

In [None]:
value_pattern = 'Total value: (.*)'

- The first part of the patter is just a literal, as in the pattern only matches the string if it has those characters in it
- The parenthasis mean give me whats in side
- The `.` matches any character and the `*` means repeat until end of line, so .* means everything until the end of line

In [None]:
re.findall(
    value_pattern,
    toy_string
)

In [None]:
data_pattern = '(?<=/| Data /|)(.*)(?=\n\n)'
data_pattern = '\| Data \|([\S\s]*?)(?=\n\n)'

In [None]:
out = re.findall(
    data_pattern,
    toy_string
)[0]

In [None]:
out.split()

- `\| Data \|` is literal, \ is used because | is a special character, and we want to look for the |'s in the string
- `([\S\s]*?)` means match everything __including new lines__, the ? at the end means match until the next pattern is matched for __the first time__, if that wasn't there it would capture eveything until __the last__ match of the next pattern
- `(?=\n\n)` means match everything __behind__ two new lines

> regex is basically its own language, so there is a lot to learn, but can be very helpful