# Table of Contents
* [Learning Objectives:](#Learning-Objectives:)
* [Pandas Exercise 1: Handling Messy Data](#Pandas-Exercise-1:-Handling-Messy-Data)
	* [Background](#Background)
	* [Set-up](#Set-up)
	* [Part 1: Read Format](#Part-1:-Read-Format)
	* [Part 2: Time Format](#Part-2:-Time-Format)
	* [Part 3: Visualization](#Part-3:-Visualization)
	* [Part 4: Optional: Simple Statistics to help visualization](#Part-4:-Optional:-Simple-Statistics-to-help-visualization)


# Learning Objectives:

After completion of this module, learners should be able to:

* read a CSV file containing uncommon text formatting
* format TimeStamp indexes upon reading a CSV file
* plot DataFrame data with matplotlib
* use simple statistics to help interpret visualization

# Pandas Exercise 1: Handling Messy Data

## Background

This exercise is a modification from an example by [Julia Evans](http://jvns.ca). The original task is laid out in [Chapter 1](http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/master/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb) of the [`pandas` cookbook](https://github.com/jvns/pandas-cookbook). Ms. Evans has shared is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).

To begin, we want to use the ``pandas.read_csv`` function to load a ``csv`` file into memory. The file in question, ``data/bikes.csv``, is a record of usage of bicycle paths in Montreal from 2012 (here's the [original page](http://donnees.ville.montreal.qc.ca/dataset/velos-comptage) in French).

> *Note: The data file contains the total count of bicycles that were observed on a path in a day, organized by date and by the name of the bicycle path.*

We can use the Unix ``head`` command (or a similar command in Windows) to examine the top six lines of the file `data/bikes.csv`.

        !head -6 data/bikes.csv # prints the first 6 lines of the file to standard output

        Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du
        Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
        01/01/2012;35;;0;38;51;26;10;16;
        02/01/2012;83;;1;68;153;53;6;43;
        03/01/2012;135;;2;104;248;89;3;58;
        04/01/2012;144;;1;116;318;111;8;61;
        05/01/2012;197;;2;124;330;97;13;95;

Notice that the first row appears corrupted because the character encoding used differs from that in the output (this permits French characters and accents). Also notice that the column separator is a semicolon (``;``). 

If we try to load this file using the standard ``pandas.read_csv`` function, we'll likely get an error because both the column separator&mdash;a semicolon (``;``) as opposed to a comma (``,``)&mdash;and the character encoding&mdash;``latin1`` as opposed to ``utf8``&mdash;differ from the defaults for this function (presumably, the ``latin1`` encoding is used to permit use of French characters & accents).

## Set-up

In [None]:
import pandas as pd
%matplotlib inline

## Part 1: Read Format

Use the documentation for the ``pandas.read_csv`` function to successfully load the file
    ``data/bikes.csv`` into a ``pandas.DataFrame``
    called ``bike_data`` using the ``Date`` column as the index.

In [None]:
# Solution: 
# Note: Read the Background section above to learn about possible utf-8/latin1 errors.

file_name = 'data/bikes.csv'

bike_data = pd.read_csv(file_name, 
                        sep=';', 
                        encoding='latin1',
                        index_col='Date')

print( "Index dtype = ", bike_data.index.dtype )
print( bike_data.index )
bike_data.head(5)

## Part 2: Time Format

Observe that the ``Date`` column of ``bike_data`` contains string representations of dates in the (unwisely ambiguous) format ``DD/MM/YYYY``. 
    
* Use the documentation to help you modify your call to ``pandas.read_csv`` to parse the dates correctly as a ``TimeStamp``.
    
* Once the ``pandas.DataFrame`` ``bike_data`` has been loaded into memory successfully, we should be able to extract columns with appropriate dates as indices, e.g.,

```bash
# Use tab-completion to enter column names easily
bike_data['Maisonneuve 1'].iloc[:5]

Date
2012-01-01     38
2012-01-02     68
2012-01-03    104
2012-01-04    116
2012-01-05    124
Name: Maisonneuve 1, dtype: int64
```

You can use the `.ix[]` selection method to mix label and position selections

```python
bike_data.ix[0:5,'Maisonneuve 1']
```

Notice that the usage is positive during the winter months which is pretty impressive in Montreal!

In [None]:
# Solution

file_name = 'data/bikes.csv'

bike_data = pd.read_csv(file_name, 
                        sep=';', 
                        encoding='latin1',
                        parse_dates=['Date'], 
                        dayfirst=True, 
                        index_col='Date' )

# Notice the Index now has a dtype=datetime64
print( "Index dtype = ", bike_data.index.dtype )
print( bike_data.index )
bike_data.head(5)

## Part 3: Visualization

Finally, use the ``pandas.Series`` method `.plot()` to generate a plot of the usage of the `Maisonneuve 1` bike trail during 2012. 

As an aside, the ``pandas.DataFrame`` class has a similar ``.plot()`` method that plots all the time series (corresponding to bike usage on each of the trails) on the same axes.

In [None]:
# Solution

bike_data.plot(figsize=(12, 10))

# Notice the text labels on the x-axis

In [None]:
# Solution Alternate 1: plot only one column

bike_data['Berri 1'].plot(figsize=(12, 10))

In [None]:
# Solution Alternate 2: 
bike_data.plot(marker='o',linestyle='None', figsize=(12, 10))

## Part 4: Optional: Simple Statistics to help visualization

Plot the usage (bicycle counter-per-day) in a way that is visually easy to parse, Compute the monthly mean of path usage, and plot that, rather than the daily usage

In [None]:
# Reminder: Pandas includes methods for computing mean(), std(), etc...

bike_data.mean()

In [None]:
# Reminder: Pandas also includes a nice summary statistics tool called describe()

bike_data.describe()

In [None]:
# Solution: compute the monthly mean bicycle count-per-day

bike_data.groupby(bike_data.index.month).mean()

In [None]:
# Solution: computing monthy mean and plotting all in one block

(
bike_data.groupby(bike_data.index.month)
         .mean()
         .plot(marker='o',linestyle='None', figsize=(12, 10))
)