In [2]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Code4Lib 2017 Workshop Part 2

## Merging Data

In many real world situations, the data we'd like to analyze comes in multiple files. Pandas provides multiple different ways of merging data, including "merge()", and "concat()". To get started, we first need to load each of our datasets into a DataFrame. We'll use the same surveys data as in the last part of the workshop, as well as another dataset called "species.csv", which augments the data available for each species code.

In [4]:
survey_dataframe = pd.read_csv("data/surveys.csv")
survey2_dataframe = pd.read_csv("data/surveys2.csv")

### Concatenation

First, we'll look at concatenating two DataFrames. We've imported the original survey data from the last part of the workshop, as well as another file "surveys2.csv", which contains additional records with the same structure as "surveys.csv". Often, data is separated over multiple files for many reasons including logical separation (time-based, etc), or due to size constraints, so it is useful to know how to concatenate multiple sets of data.

Pandas provides the "pd.concat()" method, which accepts a sequence (example: tuple, list, dict) of Series or DataFrame objects to concatenate together. We'll pass a list containing the survey_dataframe and the survey2_dataframe to create a new DataFrame "big_survey_dataframe":

In [6]:
big_survey_dataframe = pd.concat([survey_dataframe, survey2_dataframe])
big_survey_dataframe

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
5,6,7,16,1977,1,PF,M,14.0,
6,7,7,16,1977,2,PE,F,,
7,8,7,16,1977,1,DM,M,37.0,
8,9,7,16,1977,1,DM,F,34.0,
9,10,7,16,1977,6,PF,F,20.0,


### Merge

Merging in Pandas is when you combine two different DataFrames that share an identifier into a single DataFrame. This can be compared to the idea of joins in SQL. When you have two different DataFrames containing a shared identifier (think foreign key), you can use the "pd.merge()" function to merge them into a single DataFrame.

We'll first load another dataset from "species.csv", containing information which expands on the species_id in our survey dataset.

In [10]:
species_dataframe = pd.read_csv("data/species.csv")
species_dataframe.head()

Unnamed: 0,species_id,genus,species,taxa
0,AB,Amphispiza,bilineata,Bird
1,AH,Ammospermophilus,harrisi,Rodent
2,AS,Ammodramus,savannarum,Bird
3,BA,Baiomys,taylori,Rodent
4,CB,Campylorhynchus,brunneicapillus,Bird


http://pandas.pydata.org/pandas-docs/stable/merging.html



inner join merge

In [11]:
merged_survey_dataframe = pd.merge(left=big_survey_dataframe, right=species_dataframe, left_on='species_id', right_on='species_id')
merged_survey_dataframe

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent
1,2,7,16,1977,3,NL,M,33.0,,Neotoma,albigula,Rodent
2,22,7,17,1977,15,NL,F,31.0,,Neotoma,albigula,Rodent
3,38,7,17,1977,17,NL,M,33.0,,Neotoma,albigula,Rodent
4,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent
5,106,8,20,1977,12,NL,,,,Neotoma,albigula,Rodent
6,107,8,20,1977,18,NL,,,,Neotoma,albigula,Rodent
7,121,8,21,1977,15,NL,,,,Neotoma,albigula,Rodent
8,171,9,11,1977,12,NL,,,,Neotoma,albigula,Rodent
9,194,9,12,1977,11,NL,,,,Neotoma,albigula,Rodent


left join merge

## Cleaning Data

* removing duplicate entries

* remove NaN entries

* run function over each row to clean a string (or something else)

## Sorting Data

* order data by weight (asc/desc)
* group by sex
* group by species

## Transforming Data

* convert m/d/y into datetime
* convert weight to different unit, demonstrating writing a function to transform data

## Statistics

* datetime stats
* any other advanced stats

## Visualization

* bar
* box
* scatterplot
* histogram
* applying to subset of data / comparison between different subsets