# Code4Lib 2017 Workshop

## Importing the Pandas library

Before we can use Pandas, we need to import it, much like any other programming library. In addition to Pandas, we'll also import the numpy library and the part of the matplotlib library which is called "pyplot". We will also give all these libraries aliases so that they can be referred to later by shorter names. 

In [2]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Loading Data

Pandas can load data in a variety of formats, including CSV, Excel, HDF, SQL, JSON, etc. We'll specifically look at how to load CSV data, but the methods for loading other types are largely the same, and can be found here: http://pandas.pydata.org/pandas-docs/stable/io.html

We will load local data that has already been saved to our server, but you can also make web requests for data as well by passing a URL instead of a path to a file.

For this workshop, we'll work with a dataset called "surveys.csv", which contains a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA

The data is stored locally at the path "data/surveys.csv", but otherwise, the data can be downloaded here: https://ndownloader.figshare.com/files/2292172

To load the data, run the "pd.read_csv" method, and pass in the path to the CSV file as a string:


In [6]:
survey_data = pd.read_csv("data/surveys.csv")

survey_data

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
5,6,7,16,1977,1,PF,M,14.0,
6,7,7,16,1977,2,PE,F,,
7,8,7,16,1977,1,DM,M,37.0,
8,9,7,16,1977,1,DM,F,34.0,
9,10,7,16,1977,6,PF,F,20.0,


## Pandas Data Structures

Pandas provides two main data structures: Series, and DataFrame. We'll look at both of these in a little more detail.

### Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [7]:
data = [1, 2, 3, 4, 5, 6, 7]
s1 = pd.Series(data)
s1

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

In [8]:
# try doing this on your own
data2 = []
s2 = pd.Series(data)

You can also pass other types of data to create a Series, including other Python data structures, and NumPy data structures. Let's try creating a Series with a Python dictionary:

In [11]:
dict_data = {'name': 'Bret', 'occupation': 'Digital Technologies Development Librarian', 'points': 481}
s3 = pd.Series(dict_data)
s3

name                                                Bret
occupation    Digital Technologies Development Librarian
points                                               481
dtype: object

### Series Details
* is like ndarray (need to explain what this is)
* is like dict (demo key access)

### Data Frame
* How to create a data frame?
    * From series or dicts
    * From dict of ndarrays or lists
    * From structured record or array
    * From list of dicts
    * From a dict of tuples
    
* Column selection, addition, and deletion



### Pandas Data types (dtypes)

Pandas offers multiple data types (dtypes) to work with, differing slightly from the datatypes offered by Python. Here is a table summarizing the different datatypes:


| Pandas Type               | Python Native Type | Description                                                                                                                                                     |
|---------------------------|--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| object                    | string             | The most general dtype will be assigned to your column if the column has mixed types (numbers and strings)                                                      |
| int64                     | int                | Numeric characters. 64 refers to the memory allocated to hold this character                                                                                    |
| float64                   | float              | Numeric characters with decimals. If a column contains numbers and NaN (not a number), Pandas will default to float64 in case your missing value has a decimal. |
| datetime64, timedelta[ns] | N/A                | Values meant to hold time data. Look into these for time-series experiments.                                                                                    |

### Inspecting Data Types

You can find out the datatype of a particular 

In [10]:
survey_data['record_id'].dtype
survey_data['hindfoot_length'].dtype
survey_data['species_id'].dtype


dtype('O')

### Generating data

### Exploring data
* head
* tail
* index
* columns
* describe
