## Series

A one-dimensional array of **indexed** data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-poster')

In [None]:
# Create a numpy array
input_array = np.array([0.2, 0.4, 0.6, 0.8, 1.0])
input_array

In [None]:
# Slice the first three elements
input_array[0:3]

In [None]:
# Make a Series from array
data = pd.Series(input_array)
data

In [None]:
# Slice first three elements
data[0:3]

In [None]:
# Explicit index
data = pd.Series(input_array, index=['a','b','c','d','e'])
data

In [None]:
# We can access an element by an index
data['b']

In [None]:
# We can also slice by the index
data[:'c']

We can also 

In [None]:
animal_types = ['cats', 'dogs', 'guinea pigs', 'birds', 'other']
num_rescued = np.random.randint(0, 100, 5)

In [None]:
animals = pd.Series(num_rescued, index=animal_types)
animals

In [None]:
# Select the rows that match the maximum value (probably only one)
animals[animals == animals.max()]

In [None]:
# Make a bar plot
animals.plot(kind='bar', title='Rescued animals')

## DataFrames

A `DataFrame` is a collection of `Series`;

The DataFrame is the way Pandas represents a table, and Series is the data-structure
Pandas use to represent a column.

Pandas is built on top of the `numpy` library, which in practice means that
most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records
of the table, proper handling of missing values, and relational-databases operations
between DataFrames.

## Selecting values

To access a value at the position `[i,j]` of a DataFrame, we have two options, depending on
what is the meaning of `i` in use.
Remember that a DataFrame provides a *index* as a way to identify the rows of the table;
a row, then, has a *position* inside the table as well as a *label*, which
uniquely identifies its *entry* in the DataFrame.

## Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

*   Can specify location by numerical index analogously to 2D version of character selection in strings.

In [None]:
data_fn = '../data/tasmania-births-1850-1859.csv.bz2'
data = pd.read_csv(data_fn, index_col='NI_REG_PLACE')

In [None]:
# Show the top records
data.head()

In [None]:
# Display the third record (index position 2 from 0)
print(data.iloc[2])

## Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

*   Can specify location by row name analogously to 2D version of dictionary keys.

In [None]:
# Show the mother columns for all the matches for exactly Jan 01, 1791
data.loc["Launceston", "NI_MOTHER"].head(10)

## Use `:` on its own to mean all columns or all rows.

*   Just like Python's usual slicing notation.

In [None]:
# Get all the columns for a certain date
data.loc['Launceston', :].head(2)

*   Would get the same result printing `data.loc["Launceston"]` (without a second index).

In [None]:
# Could also just do
data.loc['Launceston'].head(2)

## Select multiple columns or rows using `DataFrame.loc` and a named slice.

In [None]:
data.loc['Launceston', 'NI_NAME_FACET':'NI_FATHER'].head(3)

In the above code, we discover that **slicing using `loc` is inclusive at both
ends**, which differs from **slicing using `iloc`**, where slicing indicates
everything up to but not including the final index. 


## Use comparisons to select data based on value.

*   Comparison is applied element by element.
*   Returns a similarly-shaped dataframe of `True` and `False`.

In [None]:
# Use a subset of data to keep work clean
town_reg_data = data.loc[:, 'NI_REG_YEAR']
town_reg_data.value_counts()

In [None]:
# Show a bool Series that match 
late_reg_year = town_reg_data > 1860
late_reg_year.head()

## Select values or NaN using a Boolean mask.

*   A frame full of Booleans is sometimes called a *mask* because of how it can be used.

In [None]:
# Show all the data for the late reg years
data.loc[late_reg_year].head()

## Query for values

In [None]:
# Try to make the query method work similar to above
data.query?

## 

In [None]:
# Add one-hot columns for male and female
data.loc[data.NI_GENDER == 'Male', 'Male'] = 1
data.loc[data.NI_GENDER == 'Female', 'Female'] = 1

Let's take a look at the values and notice how the empty values were populated

In [None]:
data.head(2)

We want numerical values so we can easily use `fillna(0)`:

In [None]:
data.fillna(0).head(2)

In [None]:
gender_data = data.loc[:, 'Male':'Female'].fillna(0)
gender_data.head()

In [None]:
# Get the male/female count by city
gender_data.groupby('NI_REG_PLACE').sum()

# Where to?

[Plotting in Pandaas](Plotting.ipynb).

### Questions:
- "How can I do statistical analysis of tabular data?"

### Objectives:
- "Select individual values from a Pandas dataframe."
- "Select entire rows or entire columns from a dataframe."
- "Select a subset of both rows and columns from a dataframe in a single operation."
- "Select a subset of a dataframe by a single Boolean criterion."

### Keypoints:
- "Use `DataFrame.iloc[..., ...]` to select values by integer location."
- "Use `:` on its own to mean all columns or all rows."
- "Select multiple columns or rows using `DataFrame.loc` and a named slice."
- "Result of slicing can be used in further operations."
- "Use comparisons to select data based on value."
- "Select values or NaN using a Boolean mask."

## References

### Software Carpentry
* [DataFrames](http://swcarpentry.github.io/python-novice-gapminder/08-data-frames/)

### Other
* [Introducting Pandas Objects](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)