# Example: Extracting Time Information from CSV Files

Let us now look at a typical example of transforming and extracting
information from data that is given to us in a particular format.
We shall examine various coding options and potential problems
in some detail. We shall mainly focus on low-level manipulation of
time periods represented by strings and number. But at the end of the 
exercise you will  be given a task to reimplement the functionality 
using the much higher level `datatime.timedelta` object provided by Python.

We shall consider files containing the details of music playlists
structured in the form of a CSV representation.
Note that this simple playlist format has been devised specifically
for teaching purposes. Actual music software uses other formats, the
most popular being [M3U](https://en.wikipedia.org/wiki/M3U), which is
also quite simple but is not based on the CSV format.

This notebook uses the files `music_for_programmers.csv`, `snake-music.csv` and `big_song_track_dataset.csv`, which will be in the same directory as this notebook on Coursera Labs. If you are working on your own computer, make sure you download these files and save them in the same directory as this notebook.

We display the contents of this file using the following code:

In [None]:
with open('music_for_programmers.csv') as f:
    print( f.read())

Although this file format is very simple there are many
useful things we could do with this kind of data.
Also, although the meaning is fairly obvious, there are various
issues that could arise and may complicate the interpretation of
the data.

Have a look at the example above and think about what issues
might occur.

<p>
<details>
    <summary>
        <b>Click here to see some possible issues.</b>
    </summary>
    <ul>
        <li>We need to be aware that the first line of the CSV is not
            actual data about music tracks but is rather <i>meta data</i>,
          which means that it tells us about what the data is. Specifically,
          in the case of CSV files, the first line usually consists of
          column titles which indicate the nature of the data in each
          column. However, we also need to be aware that CSV files do
            not always contain headers.</li>
        <li>There is a space before the album title MAYA. The space is not
          actually part of the album title but could possibly be included
          in the string representation of the album title, when we read
            it from this file.</li>
        <li>There is actually another space right at the end after the '18:03'
            which could be very difficult to spot.</li>
        <li>There could be some issues with the time representation. We will
            look at this below.</li>
        <li>In the given playlist none of the textual data includes commas. But
          commas might be present in the track and album names or artists of
          other tracks. In such cases we would need some way of distinguishing
          them from the commas used to separate the fields. The standard way
            is to enclose text fields in quotation marks.</li>
    </ul>
</details>
</p>

Let us now consider what kinds of uses we might make of a 
playlist CSV file (or a collection of these files) and how we might 
implement these. You may be surprised by the large number
of uses that these quite simple files could have.

Some uses and functionalities that could be provided by
playlist files:

* Display a playlist (or set of playlists) in an attractive and 
  informative way, so a user can browse the tracks and perhaps
  select a track they want to play.

* Find the total playing time of the playlist, so that a user
  can create a list to fill a certain time interval.

* Enable users to create new playlists, perhaps by selecting from
  existing playlists.
  
* Link playlists to actual music files (e.g. `MP3` files), in order 
  that tracks can be played. (Perhaps also warn if there a tracks
  in the playlist for which music files cannot be found.
  
* Generate HTML search query links from the playlist data.

* Find shared tracks or other connections between different playlists.

* Find correlations between a person's playlist selections and other
  data about the person, such as age, gender, nationality, items they
  have bought etc.
  
* Use playlists from different people to suggest new music that someone may like.

### Extracting Data from a Playlist File

We can quite easily extract data from a CSV file in the form of a _list of lists_. I call this a _datalist_.
The following code defines a function
`get_datalist_from_csv`, which reads a CSV file and
returns a datalist. In this function I am using a `csv.reader` object, from the `csv` module, which is imported. This makes things a bit easier tha reading the raw text of the file (which we could also do).

In [None]:
## Import module for handling csv files.
import csv

def get_datalist_from_csv( filename ):
    ## Create a 'file object' f, for accessing the file
    with open( filename ) as f:
        reader = csv.reader(f)     # create a 'csv reader' from the file object
        datalist = list( reader )  # create a list from the reader
    return datalist

We can test this function with the following code. Here we are using a global variable, `DATA` to store the contents of the example `csv` file in _datalist_
format and then simply printing out each element of `DATA` which is a list
fo data items:

In [None]:
DATA = get_datalist_from_csv("music_for_programmers.csv")
for row in DATA: 
    print(row)

Representing the playlist information as a list of strings makes it a bit easier to see the extra spaces in the data.

### Finding the total playing time

A quite simple but useful data extraction task we can do on the playlist
is to extract the total playing time. This will illustrate some basic
data manipulation operations.

One thing to note is that the first item is not a data record but is
a header specification that contains strings that describe the content
of each part of  following data records. We can easily strip this off with the following code:

In [None]:
DATA = DATA[1:]

It is now straightforward to extract a list of just the string representing the playing time from each of the records in the datalist:

In [None]:
time_strings = [record[3] for record in DATA]
time_strings

Of course we cannot simply add up these times, since they are represented as charater strings rather than numbers.  Also, they are written in the form `'m:s'`, where `m` is a number of minutes and `s` is a number of seconds.

### Converting a time string representation to a number of seconds
Given that converting these time strings to a numerical form, such as a number of seconds is a meaningful component of the processing we want to do, it is a good idea to define this operation as a function. We could define this as follows:

In [None]:
def time_string_to_seconds(s):
    m_s = s.split(':')
    mins = int(m_s[0]) 
    secs = int(m_s[1])
    return mins*60 + secs

Then we can test this on the `time_strings` data that was extracted from the playlist `.csv` file.

In [None]:
times = [time_string_to_seconds(s) for s in time_strings]
times

Note that in carrying out this transformation, we were lucky that the string '18:03 ' was processed correctly. This is because Python's type conversion operation 
`int('03 ')` ignores the extra space. Data quality issues are often difficult to
detect and may or may not cause problems, depending on the kind of operations we
apply. 

Suppose that we had implemented the time string conversion function
as follows:

In [None]:
def alt_time_string_to_seconds(s):
    mins = int(s[0])
    secs = int(s[-2:])
    return mins*60 + secs

## Test:
times = [alt_time_string_to_seconds(s) for s in time_strings]
times

This may appear at first sight to have worked as no error has occurred. However, if we check the values, we see that the value for the time of the final track is incorrect. In fact, as well as the problem with the data (the time string having an extra space), there is also a problem with the code.


<p>
<details>
    <summary>
        <b>What is the bug in the code?</b> (Click to reveal)
    </summary>
    <ul>
        <li>In this second definition of <code>time_string_to_seconds</code> the programmer has assumed
    that the minutes part of the time string always consists of one digit, whereas it could consist of two digits.
        </li>
    </ul>
</details>
</p>


### A better solution
Having seen how different forms of time string could occur, depending on the length
of a track, we realise that we really need a more general function that can cope
with time strings that could consist of only a number of seconds, or minutes and seconds, or hours, minutes and seconds (I guess we can assume we won't have tracks
lasting more than a day. If we did they could still be represented in terms of
hours, minutes and seconds.)
Here is a new function definition that can take care of all these possible 
formats.

In [None]:
def time_string_to_seconds(s):
    parts = s.split(':')
    parts = [int(p) for p in parts]
    if len(parts) == 1:
        return parts[0]
    if len(parts) == 2:
        return parts[0]*60 + parts[1]
    if len(parts) == 3:
        return parts[0]*60*60 + parts[1]*60 + parts[2]
    print( "!!! ERROR: time_string_to_seconds: incorrect format:", s)
    return False

### Testing the function
To be sure our function is working as intended, it is an extremely
good idea to test it on a variety of different possible inputs.
In fact the following test tries all the allowed formats, and
also one that is not allowed:

In [None]:
for s in ["5", "12", "3:1",    "3:12",  "10:12", 
          "1:3:1",   "1:3:12", "1:10:12", "10:1:1", 
          "10:1:12", "10:12:12", "9:9:9:9"]:
    print(time_string_to_seconds(s))

### Finding the total playing time

The time strings have now been converted to a simple and uniform 
numerical form (the number of seconds); so the total playing time (in seconds) 
is just the sum of these values:

In [None]:
times = [time_string_to_seconds(s) for s in time_strings]
total_time = sum(times)
total_time

### Putting it together in a function
Now that we have successfully implemented the stages of processing
required to extract the total playing time of a playlist, it is a good
idea to package up the required processing steps into a function.
This will enable us to use the function to directly get the length
of a playlist the CSV file without having to remember and recode
all the steps.
The following function definition does this job.
Notice that we make use of the previously defined functions
`get_datalist_from_csv` and `time_string_to_seconds` within
our new function `get_playlist_length`. This illustrates how the
use of small functions with a clear meaning and purpose can help
us to 

In [None]:
def get_playlist_length(filename):
    data = get_datalist_from_csv(filename)
    tracks = data[1:]
    time_strings = [record[3] for record in tracks]
    playtimes = [time_string_to_seconds(ts) for ts in time_strings]
    return sum(playtimes)

Let us test `get_playlist_length` on our two example files. The following code
returns a tuple containing the values from each of the files.

In [None]:
( get_playlist_length("music_for_programmers.csv"),
  get_playlist_length("snake-music.csv")
)

### Converting seconds back to hours, minutes and seconds

A time length given in seconds is not so easy for humans interpret. Hence, we may want to conver into hours, minutes and seconds to display the total time. 

We shall take this opportunity to advocate a very powerful method that top grade
programmers often adopt. When they need to solve a specific
problem, they consider whether it is a special case of a more general problem.
If so, it may be easier to solve the more general problem. This has the added
advantage that they can potentially use the solution again if a similar
type of coding problem arises.

In [None]:
def convert_to_mixed_base( n, bases ):
    quantities = []            # Build up a list of quantities for each factor
    for f in bases[::-1]:      # go through the bases in reverse order
        quantities.append(n%f) # add remainder from smallest base to the list
        n = n//f               # divide number by the base (// ignores remainder)
    quantities.append(n)       # Add what left (the quantity of the highest base)
    quantities.reverse()       # reverse to give higest base quantity first
    return quantities

We can test this by, for example, considering the Imperial system of weights, where there are 14 _pounds_ in a _stone_, and 16 _ounces_ in a _pound_.
Hence to convert 1000 ounces to stones, bounds and ounces we could use the
following function call:

In [None]:
convert_to_mixed_base( 1000, [14,16])

This output tells us that 1000 ounces is equal to 4 stone, 6 pounds and 8 ounces. (You can see why many people prefer the metric system. Why do we not use a base 10 system for time measurements?)

We can now easily write a function for turning a number of seconds back to a time representation string of the sort used in the playlist files:

In [None]:
def seconds_to_time_string(s):
    h_m_s = convert_to_mixed_base( s, [60,60] )
    h_m_s_strings = [str(x) for x in h_m_s]
    return ":".join(h_m_s_strings)

In [None]:
seconds_to_time_string(10000)

When programming, it is usually best to double check that all is as expected, so let us try converting that back to seconds with a simple calculation:

In [None]:
2*60*60 + 46*60 + 40 

Great. That seems to work. (Though when programming a real application testing just one case is not enough.)

###  Playing times and `timedelta` objects

We have seen how we can manipulate time information by low level operations on string
reprentations. This is a good illustration of the kinds of data conversion one 
may need to implement. However, since handling time information is a common 
requirement of many programming tasks, Python actually provides standard modules
that support much easier ways of working with time information. In particular
the module `datetime` enables one to create objects corresponding to time points,
dates, time periods and even time-zones; and these objects support a variety
of useful operations that one may want to perform with temporal information.

[Python3 datetime module documentation](https://docs.python.org/3/library/datetime.html)


Using the `datetime` package, it is easy to create `timedelta` objects, which store
time period measurements in a very flexible way. This is illustrated by the
following code example:

In [None]:
import datetime

dt1 = datetime.timedelta( hours=1, minutes=20)
dt2 = datetime.timedelta( minutes=90)
dt3 = datetime.timedelta( minutes=5, seconds=5)

print( "dt1 < dt2 ?", dt1 < dt2 )

print( "dt1 < dt3 ?", dt1 < dt3 )

dt4 = dt1 + dt2 + dt3

dt4

### A word of caution

Working with data that involves dates and times can often be difficult, in part because of the different ways that times and dates can be represented (e.g. the order in which days and months are displayed) and because times and dates are different in different places. The source of a lot of problems with dates and times is the fact that there are not an integer number of days in a year and the length of the year changes slightly each year. So if you ever have to work with data that involves dates and times, take extra care, consider possible issues with the data (such as whether it is formated in a consistent way) and check everything you do very carefully.

# Exercise Task

Now you have seen some coding examples of basic data extraction in 
relation to the CSV-based playlist format your should try the following
exercises:

* Reimplement the calculation of total playing time and its output
  in human-readable format using `datetime.timedelta` objects.
  
* Examine the file [big_song_track_dataset.csv](big_song_track_dataset.csv).
  This contains similar data to what is in our playlist format but also has
  a lot of additional information.
  Write a function that can read data from this file and extract 
  a very large playlist datalist from it in the format used above.
  (Do not include information that is not required for our playlist format
  and convert the duration information into a form that can be used by
  your function to calculate the total duration of a playlist.)
  
* Write a function that given a time duration as an input parameter
  can generate a playlist of tracks picked randomly from
  big_song_track_dataset.csv, such that the total playtime is 
  approximately the same as the time duration parameter.
  (You should consider various options of how to fulfill the requirement of the duration being approximately the same. Imagine how this function
  might be used. Should it sometimes give a playlist that is slightly
  longer or slightly shorter. Is it possible to fill time periods
  exactly, or optimally? Perhaps it should have different options
  that can be given as an additional parametar.

