# Code4Lib 2017 Workshop Part 1

## Importing the Pandas library

Before we can use Pandas, we need to import it, much like any other programming library. In addition to Pandas, we'll also import the numpy library and the part of the matplotlib library which is called "pyplot". We will also give all these libraries aliases so that they can be referred to later by shorter names. 

In [None]:
# import required libraries
import pandas as pd
import matplotlib.pyplot as plt

## Pandas Data Structures

Pandas provides two main data structures: Series, and DataFrame. We'll look at both of these in a little more detail.

### Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [None]:
data = [1, 2, 3, 4, 5, 6, 7]
s1 = pd.Series(data)
s1

In [None]:
# try doing this on your own
data2 = []
s2 = pd.Series(data2)

You can also pass other types of data to create a Series, including other Python data structures, and NumPy data structures. Let's try creating a Series with a Python dictionary:

In [None]:
dict_data = {'name': 'Bret', 'occupation': 'Digital Technologies Development Librarian', 'points': 481}
s3 = pd.Series(dict_data)
s3

### Data Frame

A DataFrame is a 2-dimensional data structure that holds data of any type in columns. It is similar to a spreadsheet or data.frame in R. DataFrames are 0-indexed.

You can create a DataFrame by passing in a dictionary whose keys are column labels, and values are lists (or Series, or ndarrays) containing column values. Here is an example of creating a DataFrame containing information about fruits:


In [None]:
fruits_df = pd.DataFrame({'name': ['apple', 'banana', 'orange'],
                          'color': ['red', 'yellow', 'orange'],
                          'size_cm': [5, 12, 4]})
fruits_df

You can add new rows to an existing DataFrame by creating a new DataFrame with the additional rows, and then appending it to the existing DataFrame. Setting the "ignore_index" parameter to True allows the new rows to be indexed according to the original DataFrame instead of the new DataFrame. Try running the next block of code, and then try setting "ignore_index" to False and re-running to see what happens. It is important to note that the append() method returns a new updated DataFrame rather than modifying it in place.

In [None]:
# adding new rows
more_fruits_df = pd.DataFrame({'name': ['strawberry', 'blueberry'],
                          'color': ['red', 'blue'],
                          'size_cm': [3, 1]})
fruits_df.append(more_fruits_df, ignore_index=True)

Notice in the previous examples, the index (left hand column) defaults to a 0-indexed, increasing number. You can also pass an array as an "index" argument to the DataFrame constructor to define a different index.

In [None]:
pd.DataFrame({'name': ['carrot', 'broccoli', 'lettuce'],
              'color': ['orange', 'green', 'green']},
             index=['first', 'second', 'third'])

To learn more about the different ways you can create DataFrames, read here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe
    
#### Column selection, addition, and deletion

A good way to think about a DataFrame is a collection of Series objects that have a single index, where each column of a DataFrame is a single Series.

You can access data within a DataFrame by column, similarly to the way you would access data from a dictionary in Python. We'll select the size_cm column of the fruits_df DataFrame from earlier:

In [None]:
fruits_df['size_cm']

## Loading Data

Pandas can load data in a variety of formats, including CSV, Excel, HDF, SQL, JSON, etc. We'll specifically look at how to load CSV data, but the methods for loading other types are largely the same, and can be found here: http://pandas.pydata.org/pandas-docs/stable/io.html

We will load local data that has already been saved to our server, but you can also make web requests for data as well by passing a URL instead of a path to a file.

For this workshop, we'll work with a dataset called "surveys.csv", which contains a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA

The data is stored locally at the path "data/surveys.csv", but otherwise, the data can be downloaded here: https://ndownloader.figshare.com/files/2292172

To load the data into a DataFrame, run the "pd.read_csv" method, and pass in the path to the CSV file as a string:


In [None]:
survey_dataframe = pd.read_csv("data/surveys.csv")

survey_dataframe

## Exploring Data

The data we just loaded has a lot of rows. Printing out the entire DataFrame gave us a compressed view of the data, but using some built-in methods, we can see a lot more information.

First, all the column names in a DataFrame are stored in the "columns" attribute, and can be returned like this: 

In [None]:
survey_dataframe.columns

Sometimes it is useful to see the first few rows or last few rows of a dataset. You can use the "head()", and "tail()" methods to do that. The default is to display the first or last 5 rows for these methods, but you can pass a number to these methods if you'd like a different number of rows.

In [None]:
survey_dataframe.head()

In [None]:
survey_dataframe.tail(10)

To get a quick idea about the size of the data you are working with, use the "shape" attribute on the DataFrame. This returns a tuple with the size of each dimension of the DataFrame:

In [None]:
survey_dataframe.shape

So far we've been looking at information that we can easily figure out by looking at the output of the entire DataFrame. We can get a better summary of the entire dataset by using the "describe()" method:

In [None]:
survey_dataframe.describe()

As you can see, we get a variety of metrics for every numeric row in the DataFrame. These include:

* Total number of entries (count)
* Mean
* Standard deviation (std)
* Minimum value (min)
* Lower quartile (25%)
* Median (50%)
* Upper quartile (75%)
* Maximum value (max)

However, these values are not as useful for enumerations such as record_id/plot_id, or ordinal values such as month/day/year. We can use the "describe()" method to get information about specific columns in a DataFrame like this:

In [None]:
survey_dataframe['weight'].describe()

You can get these statistics individually as well by calling each method separately. Find more information here about all the stats methods you can call on a DataFrame: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats

You'll notice there are other methods to get statistics that aren't represented in the "describe()" call above. A popular example is getting the mode of the column. Let's find out what the most popular species is in the data (remember the first value in the output below is the index, and not the count):

In [None]:
survey_dataframe['species_id'].mode()

In addition to seeing the most popular species, it might be useful to see a listing of all possible species. To do this, we can use the "pd.unique()" method, which returns all unique values in a Series as a NumPy array. We'll pass the 'species_id' column of the DataFrame:

In [None]:
pd.unique(survey_dataframe['species_id'])

You can also find the frequency of values in a column. Here we find how often each species is found in the data.

In [None]:
survey_dataframe['species_id'].value_counts()

## Pandas Data types (dtypes)

By now you've probably seen the word "dtype" in the output of many of the previous commands. Pandas offers multiple data types (dtypes) to work with, differing slightly from the datatypes offered by Python. Here is a table summarizing the different datatypes:


| Pandas Type               | Python Native Type | Description                                                                                                                                                     |
|---------------------------|--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| object                    | string             | The most general dtype will be assigned to your column if the column has mixed types (numbers and strings)                                                      |
| int64                     | int                | Numeric characters. 64 refers to the memory allocated to hold this character                                                                                    |
| float64                   | float              | Numeric characters with decimals. If a column contains numbers and NaN (not a number), Pandas will default to float64 in case your missing value has a decimal. |
| datetime64, timedelta[ns] | N/A                | Values meant to hold time data. Look into these for time-series experiments.                                                                                    |

### Inspecting Data Types

You can find out the datatype of a particular column by looking at the "dtype" attribute. We'll look at a few in the survey_data DataFrame from the previous section:

In [None]:
survey_dataframe['record_id'].dtype

In [None]:
survey_dataframe['weight'].dtype

In [None]:
survey_dataframe['species_id'].dtype

### NaN

Please note, null values in pandas are populated as NaN. 