# File Parsing

Now making lists for data points is good and all, but what if you have **lots** of data? Wouldn't it be nice to parse large amounts of data from a file instead of having to manually input copious amounts of information? You can achieve this through python's handy file parsing functions!

In [1]:
import matplotlib.pyplot as plt
import numpy as np

We will read Voyager-2 data in this notebook.
[Voyager-2](https://voyager.jpl.nasa.gov) is a spacecraft that was launched in [1977](https://en.wikipedia.org/wiki/Voyager_2). It collects data about the local space environment, including the flux (or rate per unit time and area) of electrons and protons in the nearby environnment.

Let's try reading the file.

We're going to use python's built-in functions to read data from a NASA Voyager data.

The line below [opens](https://docs.python.org/3/library/functions.html#open) a file called ```VY2PLA_1H_FMT.txt``` in folder ```infiles```. If you look in the github or in your file folder containing these notebooks, you should find a folder called ```infiles``` and a file therein called ```VY2PLA_1H_FMT.txt```. The ```"r"``` string at the end tells the computer to open the file for _reading_ rather than for _writing_ (or something else).


In [4]:
outfile = open("infiles/VY2PLA_1H_FMT.txt","r")
print(outfile.name)

infiles/VY2PLA_1H_FMT.txt


You might think, "Great work!", but of course, you know that when you open a file in say, Microsoft Word, and you want to know what's in there, you have to read it. You have to tell the computer to do that too. We will use [readlines()]() to do it here.

The readlines function creates a list of each line in the file. Each line is stored as a string.

In [5]:
data = outfile.readlines()

print(type(data))      

<class 'list'>


The readlines function creates a list of each line in the file (as you can see above). Each line is stored as a string. Lets find out how many lines are in this text file.

In [6]:
len(data)

50

Now lets look at the contents of the file!

In [7]:
print(data)



Look carefully at the list above that gets returned by readlines. You'll see that each item in the list is a string that corresponds to a single line of text in the file. Each string ends with ```\n``` which is a formatting command that tells the computer to go to the next line. 

#### <span style="color:blue"> Exercise 4.1 </span>
So that you can read the information in the line a little easier, write a for loop that prints each line in the file on a single line. 

In [None]:
# Write a for loop that prints each line in the file.


### Parsing data

The file we read ```VY2PLA_1H_FMT.txt``` tells us about how the data is formatted in the data files. Armed with that information (go back and review if you didn't already), we can proceed to extract individual data points from the data files. ```v2_hour_2007.txt``` is one such file.

In [10]:
datafile = open("infiles/v2_hour_2007.txt","r")

data=datafile.readlines()

# lets check out how many lines this file is
print(len(data))

1728


Now lets get an idea of what the data looks like. In the code block below, we will print the first four rows (or lines) of data.

In [14]:
# printing the first 4 lines

for x in range(4):
    print(data[x],)

2007   1 18  397.4  0.00093   16.5  396.3    2.9  -24.1

2007   1 19  389.1  0.00078    8.7  388.8   -4.9   13.9

2007   1 20  395.9  0.00063    8.0  395.8   -6.4    6.1

2007   2 23  406.4  0.00095   25.2  406.2   10.0   -0.7



In [13]:
# Printing the first 4 lines. 
for line in data[0:4]:
    print(line,)

2007   1 18  397.4  0.00093   16.5  396.3    2.9  -24.1

2007   1 19  389.1  0.00078    8.7  388.8   -4.9   13.9

2007   1 20  395.9  0.00063    8.0  395.8   -6.4    6.1

2007   2 23  406.4  0.00095   25.2  406.2   10.0   -0.7



**Side Note** 

The ```0:4``` in the brackets is called a **list comprehension**. The numbers around the colon represent indicies, the first and the last index of a list resepectively.

A list comprehension accesses the first index (inclusively) to the last index (exclusively) of a list. Here's an example below:

In [None]:
list = ['apple', 'orange', 'pineapple', 'carrot', 'parrot', 'merit']
list[1:4]

Back to our data! Let's try to extract the date (year, day-of-year, hour). As the formatting documentation describes above, the year is the first column of the data, the day-of-year is the second column, and hour is the third column.

So let's change the loop we made above by printing out ```line[0], line[1], line[2]```. Will this work?

In [None]:
for line in data[0:4]:
    year = line[0]
    dayofyear = line[1]
    hour = line[2]
    print(year, dayofyear, hour)

It shouldn't. That's because each line is a single string with all the data separated by spaces. So we need to create a new list for each column value. We can do this by using the **split** funtion.

Review the docstring for [split()](https://docs.python.org/2/library/stdtypes.html#str.split). By default the split funciton looks for spaces and divides up the line, but you can split a line based on other delimiters if you want. (Review the docs to figure out how.)

Below we will extract the first line from the data, split the line based on the spaces, and finally print the resulting list.

In [15]:
line = data[0]
print(line)
items = line.split()
print(items)

2007   1 18  397.4  0.00093   16.5  396.3    2.9  -24.1

['2007', '1', '18', '397.4', '0.00093', '16.5', '396.3', '2.9', '-24.1']


What type are the new items in the list? They should be strings, and you can tell because there are ```' '``` quotation marks around each string. 

#### <span style="color:blue"> Exercise 4.2 </span>
But we want the year, day-of-year, and hour to be ```float```'s. Write a function that:
1. accepts a line (String) of data
2. extracts the year, day and hour from the String
3. converts these values into floats and returns them

A function header is provided below to help you get started. We will test the first line of data (```data[0]```).

<details> <summary> <b>Hint!<b> </summary> Refer to the casting of notebook 1 (Introduction) for a referesher on changing variable types! You might find the <a href="https://docs.python.org/3/library/functions.html#float">float function</a> particularly useful!</details>

In [None]:
## Convert the items to floats
def convert_year_day_hour(line):
    
    #write your code here
    
    return(year, day, hour)

In [None]:
# test your function above by running this cell!
datafile = open("infiles/v2_hour_2007.txt","r")
data=datafile.readlines()

convert_year_day_hour(data[0])

Nice! Now that we have a function that converts a line of data to floats, lets collect all the data so we can use it later. Append the year, day and hour of each line of data to it's associated list below.

<details> <summary> <b>Hint!<b> </summary> The <a href="https://docs.python.org/3/tutorial/datastructures.html">append function</a> from notebook 1 (Introduction) will come in handy!</details>

In [None]:
#Now store the data in arrays
years = []
days  = []
hours = []

for line in data:
    
    year, day, hour = convert_year_day_hour(line)
    
    # write code here to append the lists
    

#### <span style="color:blue"> Exercise 4.3 </span>
We also want to look at the data recorded by the plasma sensors on Voyager-2. In particular, we're interested in the protons near the Voyager-2 spacecraft.

Go back to the formatting text and figure out which of the columns have the speed, density, and temperature of the protons. Then store the data into three lists.

Note that to get the temperature in units of Kelvin:
$$ T = 0.0052 \times 11604.505  v_{thermal}^2 $$
where $v_{thermal}$ is the proton's thermal speed.

In [None]:
# cycle through each line and extract
# the year and day of year, hour,
# proton speed in km/s, proton density in cm^-3
# and proton temperature in K, storing the data in lists

# here are some empty lists for you to use
years = []
days  = []
hours = []
proton_speeds_kms = []
proton_densities_cm3 = []
proton_temperatures_K = []

# write your code below


## Interpreting the Data
Since its been traveling for more than **40 years**, Voyager-2 may have left the edge of the solar system by now. You might wonder, how do we not know whether it has or not? Isn't the size of the solar system known? 

Well, the answer is kind of. Voyager can help us understand how large the solar system is and its shape. You can tell by plotting the proton's density, speed, and temperature over time.

The sun pumps protons into the solar system in the form of the solar wind. When Voyager-2 leaves the solar system, it crosses a shock wave. Inside of the shock, the protons are moving very fast, pumped by the solar wind. Outside, you're in interstellar space whether the particles are moving much slower. You can read the fascinating story of Voyager crossing the shock [here](https://www.nature.com/articles/454038a).

#### <span style="color:blue"> Exercise 4.4 </span>
So let's take a look at what the data are telling us. 

Make three plots:
+ The proton speed vs. the day of year
+ The proton density vs. the day of year.
+ The proton temperature vs. the day of year.

You should see a transition in all three of these plots? Which one is the most striking? Can you find the day of year (in what year?) Voyager-2 left the solar system?

In [None]:
# make a plot of the proton speed vs. the day of year




In [None]:
# make a plot of the proton density vs. the day of year




In [None]:
# make a plot of the proton temperature vs. the day of year.




## Writing Data to a File

So we did all that work, but have nowhere to save our calculations! If only these lists could be **written** to a file ...

Python fortunately has a library for that! Using the csv library, we can write data to a _comma seperated file_ (or CSV for short). 

A CSV is like an excel spreadsheet, only formatted in a way that is easy to parse. The file starts with an (optional) header that lists each column name, followed by the rows of data. Each item in a column of a CSV are seprated by a comma (hence _comma seperated_). Check out the example of the school data below to see how the file is translated into a table. 

**school_data.csv** :

```
Student Id, Major, GPA
52468, "Physics", 4.0 
32691, "Computer Science", 2.6
26483, "Biology", 3.1
```

| Student Id | Major            | GPA |
|------------|------------------|-----|
| 52468      | Physics          | 4.0 |
| 32691      | Computer Science | 2.6 |
| 26483      | Biology          | 3.1 |

So now that you have an idea of how data is stored in a CSV, lets create one! This can be achieved by writing a for loop that creates a string with comma seperated values, but this can be a bit tedious to code. So we will be using the **csv.writer** function to simplify the file creation process.

The **csv.writer** function takes a new file as an argument to create a csv.writer object (```writer```). To write a row of data to the object we created, we will use the **.writerow** function which accepts an 

In [12]:
# now let's write our data to some files
import csv

writer = csv.writer(open('data/proton_speeds_kms.csv', 'w'))

# this writes the header of the CSV file
writer.writerow(['day', 'proton speed (km/s)'])

# this stores all the data you collected in the days and proton_speeds_kms lists
for i,ps in enumerate(proton_speeds_kms):
    writer.writerow([days[i], ps])

In [14]:
type(writer)

_csv.writer

Lets verify that the table we created is correct. We will be using a library called **pandas** to view the table. This library will be explored more in future exercises.

In [None]:
import pandas as pd

# this will display the CSV you created
pd.read_csv('data/proton_speeds_kms.csv').head()

#### <span style="color:blue"> Exercise 4.5 </span>

Write the data for proton densities and proton temperatures in their own CSV files. Make sure they have proper headers!

In [None]:
# Proton Densities Data
writer = csv.writer(open('data/proton_densities_cm3.csv', 'w'))


# Proton Temperatues Data
writer = csv.writer(open('data/proton_temperatures_K.csv', 'w'))



In [None]:
# verify proton densities CSV
pd.read_csv('data/proton_densities_cm3.csv').head()

In [None]:
# verify proton temperatures CSV
pd.read_csv('data/proton_temperatures_K.csv').head()

## Coming Soon! An Introduction to Pandas