## Use the Pandas library to do statistics on tabular data.

*   Pandas is a widely-used Python library for statistics, particularly on tabular data.
*   Borrows many features from R's dataframes.
    *   A 2-dimenstional table whose columns have names
        and potentially have different data types.

Goals:

*   Load it with `import pandas as pd`.
*   Read a Comma Separate Values (CSV) data file with `pandas.read_csv`.
    *   Argument is the name of the file to be read.
    *   Assign result to a variable to store the data that was read.

In [24]:
import pandas as pd

In [27]:
data_fn = '../data/tasmania-births-1790-1799.csv.bz2'

> ## File Not Found
>
> Our data directory is stored in a subdirectory called `data` off the parent directory. 
> Similaryly, notebooks are in `notebooks`. To get at the data from the notebook we use:
> `../data/births-1790-1799.csv.bz2`.

In [28]:
data = pd.read_csv(data_fn)
data

Unnamed: 0,NI_BIRTH_DATE,NI_BIRTH_DATE.1,NI_GENDER,NI_NAME_FACET,NI_MOTHER,NI_FATHER,NI_REG_YEAR,NI_REG_PLACE
0,1799-12-01,Dec 1799,Male,"Smith, Thomas","Smith, Mary","Smith, Name Not Recorded",1816,Hobart
1,1797-01-01,1797,Female,"Burn, Elizabeth","Burn, Mary","Burn, Name Not Recorded",1816,Hobart
2,1799-01-01,1799,Male,"Surname Not Recorded, Robert New-norfolk","Name, Not Recorded","Surname Not Recorded, Name Not Recorded",1819,Hobart
3,1797-01-01,1797,Female,"Davey, Lucy Margaret","Davey, Margaret","Davey, Thomas",1821,Hobart
4,1798-01-01,1798,Male,"Day, Thomas","Name, Not Recorded","Day, Name Not Recorded",1831,Hobart
5,1799-12-25,25 Dec 1799,Female,"Styles, Elizabeth","Styles, Elizabeth","Styles, Name Not Recorded",1835,Hobart


*   The columns in a dataframe are the observed variables, and the rows are the observations.
*   Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

## Use `index_col` to specify that a column's values should be used as row headings.

*   Row headings are numbers (0 and 1 in this case).
*   Really want to index by date of birth (`NI_BIRTH_DATE`).
*   Pass the name of the column to `read_csv` as its `index_col` parameter to do this.

In [29]:
index_col = 'NI_BIRTH_DATE'

In [30]:
data = pd.read_csv(data_fn, index_col=index_col)

## Call `head` and `tail` on the dataframe to look at the start and end of the table

Use `data.head?` to see the optional arguments 

In [33]:
# Show the first 5 rows of the data

In [34]:
# Show the last 3 rows of the data

## Use `DataFrame.info` to find out more about a dataframe.

In [21]:
# Get info

*   This is a `DataFrame`
*   Two rows named `'Australia'` and `'New Zealand'`
*   Twelve columns, each of which has two actual 64-bit floating point values.
    *   We will talk later about null values, which are used to represent missing observations.
*   Uses 208 bytes of memory.


## The `DataFrame.columns` property stores information about the dataframe's columns.

*   Note that this is a data property, *not* a method.
    *   Like `math.pi`.
    *   So do not use `()` to try to call it.
*   Called a *member variable*, or just *member*.

In [22]:
# Show the data frame columns

## Use `DataFrame.T` to transpose a dataframe.

*   Sometimes want to treat columns as rows and vice versa.
*   Transpose (written `.T`) doesn't copy the data, just changes the program's view of it.
*   Like `columns`, it is a member variable.


## Use `DataFrame.describe` to get summary statistics about data.

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. 
All other columns are ignored, unless you use the argument `include='all'`.


In [45]:
# Call describe on the data here

## Writing Data
 
As well as the `read_csv` function for reading data from a file,
Pandas provides a `to_csv` function to write dataframes to files.
Applying what you've learned about reading from files,
write one of your dataframes to a file called `processed.csv`.
You can use `help` to get information on how to use `to_csv`.

In [23]:
# Write out a dataframe to a csv file

# Where to?

All about [DataFrames](DataFrames.ipynb).

### Questions:
- "How can I read tabular data?"

### Objectives:
- "Import the Pandas library."
- "Use Pandas to load a simple CSV data set."
- "Get some basic information about a Pandas DataFrame."

### Keypoints:
- "Use the Pandas library to get basic statistics out of tabular data."
- "Use `index_col` to specify that a column's values should be used as row headings."
- "Use `DataFrame.info` to find out more about a dataframe."
- "The `DataFrame.columns` variable stores information about the dataframe's columns."
- "Use `DataFrame.T` to transpose a dataframe."
- "Use `DataFrame.describe` to get summary statistics about data."

## References

### Software Carpentry
* [Reading Tabular Data](http://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/)