# Section 2 Reading and Parsing Data From a File

So far, you have been working with very short lists of data or generating your own lists of data. However, in data-driven science, you'll likely have to work with large sets of data that can come in a variety of formats: comma-separated values, tab-separated values, space-separated values, to name a few. Sometimes specific quantities can come in strange formats depending on the meaning, for example, a list of times might be formatted in hours, minutes and seconds, like this: 10:03:20. Whatever format your data might come in, you must be able to read in that data and separate it into more easily processed components that you can then analyze. That's exactly what parsing data means.

## 2.1 Reading Data From a File

The first step is to learn how to use Python to directly read data from a file. Let's say we have a data file called `planet_data.txt` containing the solar system planet names and their masses, that looks something like this:

| Planet name        | Mass ($10^{24}$ kg)  |
| :----------------- | :------- |
| __Mercury__       |   0.330  |
| __Venus__         | 4.87  |
| __Earth__         | 5.97  |
| __Mars__          | 0.642  |
| __Jupiter__       | 1898 |
| __Saturn__        | 568  |
| __Uranus__        | 86.8  |
| __Neptune__       | 102 |


To open and read this data file, we use the commands:

```python
planet_file = open('planet_data.txt')
data = planet_file.read()
 ```

Here, `planet_file` is a file object, which is a kind of object that allows you to access and manipulate a user file. Once the file object is created (the first line of code), you can use this to reference the file, and access or manipulate it with functions such as read, readline, readlines, write, seek, and close. The second line reads the entire contents of the file. Notice that the first row is a comment explaining the meanings of the columns, which is helpful to anyone looking at the file.


In [None]:
# Running in Google Colab? Run this cell
!wget https://raw.githubusercontent.com/CIERA-Northwestern/REACHpy/main/Module_3/data/planet_data.txt

In [None]:
planet_file = open('planet_data.txt')
data = planet_file.read()

print(data)

When we open the file this way, Python gave us the permission to read it by default. Sometimes you might want to instead write or append to the file. You can specify which permissions you want from a file when you open it by typing 'r', 'w', or 'a' (they are read, write, and append) in the open command , like this: 
```python
planet_file = open('planet_data.txt' ,'r')
```

When you use planet_file.readlines() instead of planet_file.read(), you see your planet data with some strange characters interspersed with it. This result is a single string that includes the entire contents of the opened file, including the tab symbols (represented by `\t`) and the "newline" (represented by `\n`) symbols, both of which were technically part of the original file's structure. This is how Python interprets strings that it reads in. Here is the example:

In [None]:
planet_file = open('planet_data.txt','r')
data = planet_file.readlines()

print(data)

If you want to split the output into a list of lines without the additional newline characters (`\n`), you can use `read().splitlines()`:

```python
planet_file = open('planet_data.txt','r')
data = planet_file.read().splitlines()

print(data)
```

In [1]:
# Try the code snippet here

### Closing files
It is good practice to close a file as soon as you are done using it, to avoid accidentally reading from or writing to it later in your program. If you are using multiple or large files, you could potentially run out of memory if you do not close the files.

To close this file, you can use the function planet_file.close().
Close the file in the cell below:

In [None]:
planet_file.close()

## 2.1.1 Intermediate file reading

The basic way to handle opening, reading and closing a file is typically handled in the following manner:

```python
f = open('planet_data.txt', 'r')
data = f.read().splitlines()
f.close()

print(data)
```

An alternative way to handle this is by using `with` as follows:

```python
with open('planet_data.txt', 'r') as f:
    data = f.read().splitlines()

print(data)
```

In this manner, Python will automatically close the file once it moves outside of the indented portion of your code.

## 2.2 Parsing Data

There are multiple ways to break down the content of a file while reading it into Python. Let's think about the structure of our planet data file. 

First off, you may have noticed that the first line of the data file is a comment indicating what the different columns represent - a very good practice! But we want to be able to separate that out when we're reading in our data - we can do this with the startswith function. 

Then, the entries are separated by whitespace, and we can use the split function to break those up. The block of code below shows how this comes together to help us parse the data file. Analyze it, then evaluate the cell and inspect the results. 

In [None]:
planets, masses = [], []   # Notice, we can define multiple things on a single line
planet_file = open('planet_data.txt')
data = planet_file.read().splitlines()

for line in data:
    if not line.startswith('#'):  # Only continue if the line is NOT a comment
        fields = line.split()
        planets = planets + [fields[0]]
        masses = masses + [float(fields[1])]

planet_file.close()
print(planets)
print(masses)


Here's what the above for loop did:

 1. First, we split each line in data into a separate string. <br>

 2. Then, we split the fields at the white space that is between them (creating two fields, fields[0] and fields[1]): <br>
['Mercury', '0.330']<br>
['Venus', '4.87']<br>
['Earth', '5.97']<br>
['Mars', '0.642']<br>
['Jupiter', '1898']<br>
['Saturn', '568']<br>
['Uranus', '86.8']<br>
['Neptune', '102']<br> 

3. Finally, we append the planet name and the mass from each line to the lists we created at the beginning of the block of code.

Note that there are two columns in planet_data.txt which are planet names and masses. The fields list contains both planet names and masses. The first element of the field list is planet name = field[0] and the second one is mass = field[1]. If we had another column, such as radius, then we would write radius = field[2].


### 2.2.1 Advanced Parsing


As we mentioned, data can come in many different formats, using many different kinds of delimiters. A delimiter is a sequence of one or more characters used to specify the boundary between separate entries in a file. 

Luckily, we can use the split function to split up strings with any given delimiter (or multiple delimiters) by plugging the delimiter inside the parentheses, e.g., `split(',')`, `split(':')`. As you can guess, with nothing inside the parentheses,e.g., `split()`, Python will be looking for simply whitespace (this is what we did for the planets data). Now, let's look at a more complex string.

In [None]:
challenge = 'Do:you think you:can-parse; this string?'

See if you can break this down to just the word parse in the cell below. We'll get you started.

In [None]:
a = challenge.split()    # you know what this does already! Now a is a list of 4 strings
print('a=',a)
b = a[2].split('-')           # now we've taken the third item and split it off at the dash (-)
print('b=',b)

## Practice

Follow the instructions in the cell below to practice reading in and parsing data from a file. The data file, which contains a small snippet of seismic wave data, is named `seismic.txt`. You'll be reading in a few lines of that data, and then extracting the arrival times of the signals, which are in a format that looks like year-month-day"T"hour:minute:seconds"Z". However, let's assume that we only want the arrival times for the waves that have "Pn" seismic phase (but not the ones with phase "P"). 

In [None]:
# Running in Google Colab? Run this cell
!wget https://raw.githubusercontent.com/CIERA-Northwestern/REACHpy/main/Module_3/data/seismic.txt

In [None]:

# seismic.txt contains the data you'll use in this exercise 

# Open and read that data file using read().splitlines()

# PLACE YOUR CODE HERE

# For all lines that that contain Pn (not P) in column 7,
#    isolate the time portion of the line (hrs:mins:secs)

# PLACE YOUR CODE HERE

# Save hours, minutess and seconds to three separate lists.
# Hours and minutes should stored in integers, and seconds in floats.
# Remember to skip over the comment line!

# PLACE YOUR CODE HERE

# Close the file

# PLACE YOUR CODE HERE

# Print your lists to check that you were successful

# PLACE YOUR CODE HERE



## Takeaways

- Real data can come in many different formats, some more complex than others. You must be able to read in and parse your data before you can extract the quantities needed to do your calculations<br>
- There are many ways to read in files. One of the simplest is with Python's built-in functions for working with file objects, including read, readline, readlines, which return a string or a list of strings that you can then manipulate<br>
- Use the split function to break up a string into its individual fields based on the specific delimiter(s) used in the string, e.g., split(), split(':') and split(',').<br>