# Lesson I

## Reading date and time data in Pandas

In this chapter, you will use the Pandas library to work with dates and times. You should have encountered Pandas before, but now we will add datetimes to the mix.

### A simple Pandas example

To start with, let's load data with Pandas. First, we *import pandas*, and as is customary we use the alias *pd*. 
Our data is in a csv file, so we load it with the ``read_csv()`` function. ``pd.read_csv()`` has one required argument, the name of the file to load, which in this case is ``capital-onebike.csv``. We save the result to the variable rides. Let's print the first three rows to see what we've got.

In [2]:
# Load Pandas
import pandas as pd
# Import W20529's rides in Q4 2017
rides = pd.read_csv('datasets/capital-onebike.csv')
# See our data
rides.head(3)

Unnamed: 0,Start date,End date,Start station number,Start station,End station number,End station,Bike number,Member type
0,2017-10-01 15:23:25,2017-10-01 15:26:26,31038,Glebe Rd & 11th St N,31036,George Mason Dr & Wilson Blvd,W20529,Member
1,2017-10-01 15:42:57,2017-10-01 17:49:59,31036,George Mason Dr & Wilson Blvd,31036,George Mason Dr & Wilson Blvd,W20529,Casual
2,2017-10-02 06:37:10,2017-10-02 06:42:53,31036,George Mason Dr & Wilson Blvd,31037,Ballston Metro / N Stuart & 9th St N,W20529,Member


We can also select a particular column by using the brackets, as here where we call ``rides['Start date']``. And we can get a particular row with .``iloc[]``, in this case row number 2. Because we didn't tell Pandas to treat the start date and end date columns as datetimes, they are simply strings or objects. We want them to be datetimes so we can work with them effectively, using the tools from the first three chapters of this course.

In [3]:
rides['Start date']

0      2017-10-01 15:23:25
1      2017-10-01 15:42:57
2      2017-10-02 06:37:10
3      2017-10-02 08:56:45
4      2017-10-02 18:23:48
              ...         
285    2017-12-29 14:32:55
286    2017-12-29 15:08:26
287    2017-12-29 20:33:34
288    2017-12-30 13:51:03
289    2017-12-30 15:09:03
Name: Start date, Length: 290, dtype: object

In [4]:
rides.iloc[2]

Start date                               2017-10-02 06:37:10
End date                                 2017-10-02 06:42:53
Start station number                                   31036
Start station                  George Mason Dr & Wilson Blvd
End station number                                     31037
End station             Ballston Metro / N Stuart & 9th St N
Bike number                                           W20529
Member type                                           Member
Name: 2, dtype: object

### Loading datetimes with parse_dates

If we want Pandas to treat these columns as datetimes, we can make use of the argument ``parse_dates`` in ``.read_csv()``, and set it to be a list of column names, passed as strings. 
Now Pandas will read these columns and convert them for us to datetimes. 

Pandas will try and be intelligent and figure out the format of your datetime strings. In the rare case that this doesn't work, you can use the ``to_datetime()`` method that lets you specify the format manually. For more details, see the Pandas documentation.

In [5]:
# Import W20529's rides in Q4 2017
rides = pd.read_csv('datasets/capital-onebike.csv',
                    parse_dates= ['Start date', 'End date'])

# Or:
rides['Start date'] = pd.to_datetime(rides['Start date'],
                                     format= "%Y-%m-%d %H:%M:%S")


Now when we again ask for the Start date for row 2, we get back a *Pandas Timestamp*, which for essentially all purposes you can imagine is a Python Datetime object with a different name. They behave basically exactly the same.

In [6]:
# Select Start date for row 2
rides['Start date'].iloc[2]

Timestamp('2017-10-02 06:37:10')

### Timezone-aware arithmetic

Since our ``Start date`` and ``End date`` columns are now ``datetimes``, we can deal with them the way we usually deal with datetimes. 
For example, we can create a new column, ``Duration``, by subtracting ``Start date`` from ``End date``. 
Because each of these columns are datetimes, when we subtract them we get ``timedeltas``. 

If we print out the first 5 rows, we get that the first ride lasted for only 3 minutes and 1 second, the second ride lasted for 2 hours and 7 minutes, the third ride lasted for 5 minutes 43 seconds, and so on.

In [7]:
# Create a duration column
rides['Duration'] = rides['End date'] - rides['Start date']
# Print the first 5 rows
rides['Duration'].head()    

0   0 days 00:03:01
1   0 days 02:07:02
2   0 days 00:05:43
3   0 days 00:21:18
4   0 days 00:21:17
Name: Duration, dtype: timedelta64[ns]

### Loading datetimes with parse_dates

Pandas has two features worth noting here. Let's see an example of converting our ``Duration`` to *seconds*, and looking at the first 5 rows. 

First, Pandas code is often written in a *"method chaining"* style, where we call a *method, and then another, and then another*. 

For readability, it's common to break them up with a backslash('\') and a linebreak at the end of each. 
Second, you can access all of the typical datetime methods within the namespace ``.dt.`` For example, we can convert our timedeltas into numbers with ``.dt.total_seconds()``. 

Now when we look at the results, we see that we've got seconds instead of timedeltas. Our first ride lasted 181 seconds, our second ride 7622 seconds, and so on.

In [8]:
rides['Duration']\
    .dt.total_seconds()\
        .head()

0     181.0
1    7622.0
2     343.0
3    1278.0
4    1277.0
Name: Duration, dtype: float64