# Section 2.3 | Reading and Parsing Data From a File

So far, you have been working with very short lists of data or generating your own lists of data. However, in data-driven science, you'll likely have to work with large sets of data that can come in a variety of formats: comma-separated values, tab-separated values, space-separated values, to name a few. Sometimes specific quantities can come in strange formats depending on the meaning, for example, a list of times might be formatted in hours, minutes and seconds, like this: 10:03:20. Whatever crazy format your data might come in, you must be able to read in that data and separate it into more easily processed components that you can then analyze. That's exactly what parsing data means.

## Reading Data From a File

The first step is to learn how to use Python to directly read data from a file. Let's say we have a data file called citydata.txt containing a few city names and their populations, that looks something like this:

#CityName &nbsp; &nbsp; &nbsp;  &nbsp; Population<br>
Chicago &nbsp; &nbsp; &nbsp; &nbsp;  &nbsp; 2.704958e6<br>
Seattle &nbsp; &nbsp; &nbsp;  &nbsp;  &nbsp; &nbsp; 7.04352e5<br>
Dallas  &nbsp; &nbsp; &nbsp;  &nbsp;  &nbsp; &nbsp;  &nbsp;  1.197816e6<br>
Philadelphia  &nbsp;  &nbsp; 1.567872e6<br>

To open and read this data file, we use the commands:

> cityfile = open('citydata.txt')<br>
> data = cityfile.read()

Here, cityfile is a file object, which is a kind of object that allows you to access and manipulate a user file. Once the file object is created (the first line of code), you can use this to reference the file, and access or manipulate it with functions such as read, readline, readlines, write, seek, and close. The second line reads the entire contents of the file. Notice that the first row is a comment explaining the meanings of the columns, which is helpful to anyone looking at the file. Python is smart enough to know that it should ignore that line when reading in the file.

### Try it out

Using the jupyter lab text editor, open a new text file named cityfile.txt, and copy and paste into it the four lines of city data. After saving and closing the file, type the commands above in the cell below, to first open, and then read the contents of the data file. Then print out the result of "data" - what type of object does it represent? 

In [23]:
cityfile = open('citydata.txt')
data = cityfile.read()

print(data)

#CityName         Population
Chicago           2.704958e6
Seattle             7.04352e5
Dallas               1.197816e6
Philadelphia     1.567872e6


When we open the file this way, Python gave us the permission to read it by default. Sometimes you might want to instead write or append to the file (which Python does not give you permission to do by default, and you can probably guess why!). You can specify which permissions you want from a file when you open it by typing 'r', 'w', or 'a' in the open command, like this: cityfile = open('citydata.txt' ,'r')

Now try entering the command data = cityfile.readlines(). What is the result? With the first read command we used earlier, we read the whole file, and now we are at the end of the file. Inspect the result, and make sure you understand it.

You probably see your city data with some strange characters interspersed with it. This result is a single string that includes the entire contents of the opened file, including the tab symbols (represented by \t) and the "newline" (enter/return) symbols, both of which were technically part of the original file's structure. This is how Python interprets strings that it reads in.

In [24]:
cityfile = open('citydata.txt')
data = cityfile.readlines()

print(data)

['#CityName         Population\n', 'Chicago           2.704958e6\n', 'Seattle             7.04352e5\n', 'Dallas               1.197816e6\n', 'Philadelphia     1.567872e6']


Next, try out the command data = cityfile.readline() (note, the readline is singular here). What is the result? 

Try repeating that one again. You should start to see the pattern of how these different file reading functions work. 

Feel free to explore the others (e.g., write, seek) on your own. So you've now learned one way to read in data from a file (there are many other ways, and we'll learn one or two more in later modules). 

In [25]:
data = cityfile.readline()

print(data)




It is often good practice to close a file as soon as you are done using it, to avoid accidentally reading from or writing to it later in your program. To close this file, you can use the function cityfile.close().
Close the file in the cell below:

In [26]:
cityfile.close()

In [27]:
cities, populations = [ ], [ ]   # Notice, we can define multiple things on a single line
cityfile = open('citydata.txt')
data = cityfile.readlines()

for line in data:
    if not line.startswith('#'):      # only continue if the line is NOT a comment
        fields = line.split()
        cities = cities + [fields[0]]
        populations = populations + [float(fields[1])]

cityfile.close()
print(cities)
print(populations)

['Chicago', 'Seattle', 'Dallas', 'Philadelphia']
[2704958.0, 704352.0, 1197816.0, 1567872.0]


Here's what the above for loop did:

> 1. The first line split each line in data into a separate string<br>
> 2. The second line splits the fields at the white space that is between them (creating two fields, fields[0] and fields[1]) <br>
> 3. The third and fourth lines append the city name and the population from that line to the previously defined list (note, we created these empty lists at the beginning of the block of code)


## Advanced Parsing

The city data example was a pretty example to parse. But your data files will not be that easy. 

As we mentioned, data can come in many different formats, using many different kinds of delimiters. A delimiter is a sequence of one or more characters used to specify the boundary between separate entries in a file. 

Luckily, we can use the split function to split up strings with any given delimiter (or multiple delimiters) by plugging the delimiter inside the parentheses, e.g., split(','), split(':'). As you can guess, with nothing inside the parentheses, Python will be looking for simply whitespace. Now, let's look at a more complex string.

In [28]:
challenge = 'Do:you think you:can-parse; this string?'

See if you can break this down to just the word parse in the cell below. We'll get you started.

In [29]:
a = challenge.split()    # you know what this does already! Now a is a list of 4 strings
b = a[2].split('-')           # now we've taken the third item and split it off at the dash (-)


## Practice

Follow the instructions in the cell below to practice reading in and parsing data from a file. The data file, which contains a small snippet of seismic wave data, is called 'data.txt'. You'll be reading in a few lines of that data, and then extracting the arrival times of the signals, which are in a format that looks like year-month-day"T"hour:minute:seconds"Z". However, let's assume that we only want the arrival times for the waves that have "Pn" seismic phase (but not the ones with phase "P"). 

In [None]:

# data.txt contains the data you'll use in this exercise 
# (you can read it in in the usual way)

# Open and read that data file using readlines()

FILL IN CODE



# For all lines that that contain Pn (not P) in column 7,
#    isolate the time portion of the line (hrs:mins:secs)
# Then save hrs, mins and secs values to three different lists with those names
# hrs and mins should be a list of integers, secs a list of floats
# Tip: Make sure you skip over the comment line!

FILL IN CODE



# Close the file
FILL IN CODE

# Make sure you compare to the data to check that you've done it correctly!





## Takeaways

> - Real data can come in many different formats, some more complex than others. You must be able to read in and parse your data before you can extract the quantities needed to do your calculations<br>
> - There are many ways to read in files. One of the simplest is with Python's built-in functions for working with file objects, including read, readline, readlines, which return a string or a list of strings that you can then manipulate<br>
> - Use the split function to break up a string into its individual fields based on the specific delimiter(s) used in the string, e.g., split(), split(':') and split(',').<br>