![](https://snag.gy/h9Xwf1.jpg)

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Introduction to `pandas` 2

_Authors: Dave Yerrington (SF)_

---

`pandas` is the most popular python package for managing datasets and is used extensively by data scientists.

### Learning Objectives

- Series axis 1 vs 0
- Understanding Pandas datatypes
- Selection of data
 - Filtering / masking
- Basic Plotting

### Lesson Guide

- [Pandas Indexing](#indexing)
- [Creating DataFrames](#creating_dataframes)
- [Checking Data Types](#dtypes)
- [Renaming and Assignment](#renaming_assignment)
- [Basic `pandas` Plotting](#basic_plotting)
- [Logical Filtering](#filtering)
- [Review](#review)

### There's more to know about Series

There are many operations we can perform on our DataFrames.  Before we step too far into the world of complex transformations, it's important to note the two main aspects of how **series** data can be accessed within a _DataFrame_.

### Axis = 1: Columns

So far we know we can select one, or many column series within brackets `df[series references here]`.  This selector works with columns which are a series.  We can access the columns axis by _column name_, or _numeric index_.
![image.png](attachment:image.png)

### Axis = 0:  Rows

There are times we might want to access our data by the row element.  As we get into cleaning data and transforming it for the various applications we will be using, this is also another aspect of accessing our _DataFrames_ that we will need to be familliar with.  We can access the rows axis by literal index value (even if it's a string), or by numeric index.  More on this in the near future.

![image.png](attachment:image.png)

#### What we can do with axis:
- Select series by row (axis 0), or column (axis 1).
- Use `.map()` functions on individual columns (series), or `.apply()` on row or access over the entire DataFrame.
- We can talk to our friends and colleagues about data in a very specific way.

<a id='indexing'></a>

### Pandas Indexing 

---

More often than not, we want to operate on or extract specific portions of our data. When we perform indexing on a DataFrame or Series we can specifying the specific section of the data we want to operate on.

Pandas has three properties that you can use for indexing:

- **`.loc`** indexes with the _labels_ for rows and columns axis.
- **`.iloc`** indexes with the _integer positions_ for rows and columns axis.
> In the newer versions of Pandas `.ix` is now deprecated for those who are familliar with it. **`.ix`** indexes with _both labels and integer positions.  Leaving this here for future reference_.

To help clarify these differences, lets first re-set the row labels to letters using the ```.set_index()``` function (or setting the index property explicity):

In [7]:
import pandas as pd

# Lets load this drug data again.
drug = pd.read_csv("./datasets/drug-use-by-age.csv")

In [None]:
new_index_values = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q']
# Update our index to this new set of values:  Is this new_index_values a series?

Using the **`.loc`** indexer, we can pull out the rows **B thru F** and the columns **marijuana-use and marijuana-frequency**.

In [9]:
# ['B','C','D','E','F'], ["marijuana-use", "marijuana-frequency]
# subset = drug.loc[???????]


We can do the same thing with the **`.iloc`** indexer, but we have to use integers for the location.

In [None]:
# [1,2,3,4,5], [4,5]
# subset = drug.iloc[?????????]

If we you index the rows or columns with integers using **`.loc`**, you will get an error.

##### How can we reference the variables "age" and "crack-use", but only rows "C" and "F"?

In [None]:
# A:

<a id='creating_dataframes'></a>

### Creating DataFrames

---

The simplest way to create your own dataframe when not importing from a file is to give the ```pd.DataFrame()``` instantiator a dictionary.

In [10]:
mydata = pd.DataFrame({
    'Letters':  ['A','B','C'], 
    'Integers': [1,2,3], 
    'Floats':   [2.2, 3.3, 4.4]
})

In [12]:
# Check it out

As you might expect, the dictionary needs to have lists of values that are all the same length. The keys correspond to the names of the columns and the values correspond to the data in the columns.

<a id='dtypes'></a>

### Examining data types

---

Pandas comes with a useful property to look at just the data types of your DataFrame columns. Use ```.dtypes``` on your DataFrame:

In [13]:
# A:

This will show you what data type each column is. Strings are stored as a type called "object" because they are not guaranteed to take up a set amount of space (strings can be of any length).

#### Can you think of any reasons why you might want to check your dtypes?

<a id='renaming_assignment'></a>

### Renaming and Assignment

---

Pandas makes it easy to change column names and assign values to your DataFrame.

Say we wanted to change the column name "Integers" to "int":

In [14]:
## Check out columns property

'Integers'

In [15]:
# Use rename function -- reference inline documentation 
# inplace = True updates mydata
print(mydata.columns)

Index(['Floats', 'Integers', 'Letters'], dtype='object')


In [None]:
# Display DataFrame post-rename operation

If you wanted to change every column name, you could just assign a new list to the ```.columns``` property.

In [None]:
# Rename columns via property to ['A','B','C']
mydata.head()

In [22]:
# Select all rows in iloc[:], with columns 0:3, iloc[,0:3]

In [23]:
# Selecting series with column reference for features 0:3

We can assign values using the indexing that we learned before.

Let's change the newly renamed "B" column at row index 1 to be 100.

In [None]:
# .loc[?, ??]

Alternatively we can assign multiple values at once with lists.

In [None]:
# mydata.loc[:, 'A'] = [0,0,0]
# print(mydata)

mydata.loc[0, ['B','C']] = [-1000, 'newstring']
print(mydata.head())

<a id='basic_plotting'></a>

### Basic plotting using DataFrames

---

DataFrames also come with some basic convenience functions for plotting data. First import matplotlib and set it to run "inline" in your notebook.

In [24]:
import matplotlib.pyplot as plt

%matplotlib inline

Using our ```drug``` DataFrame again, use the ```.plot()``` function to plot the **age** columns against the **marijuana-use** column.

In [26]:
# plot x as age, y as marijuana-use - title="Drug use by age"

The ```.hist()``` function will create a histogram for a column's values.

In [27]:
# Plot histogram of feature / variable / column: "marijuana-use"

### Pandas plotting features

It's very handy to be able to plot multiple figures within a single figure.  Since Pandas uses Matplotlib under the hood, it's very useful to combine these tools to get the most out of your plots.

In [30]:
# import matplotlib, setup figure with 1 row and 2 columns - blank

Accessing our individual sub-figure plots can be done via `ax[index]` that we defined earlier.

In [32]:
# Same plot as before, but with text features

### Adding our Pandas plots using `ax`

In [34]:
## Plot both line plot and histogram in one figure, from Pandas to Matplotlib

### More than one row

`ax[row, column]`

In [36]:
## Demonstrate multiple ax row / column reference

In [39]:
## 3 rows of figures

In [41]:
## 6 plots -- 2x3

<a id='filtering'></a>

### Filtering Logic

---

One of the most powerful features of DataFrames is using logical commands to filter data.

Subset the ```drug``` data for only the rows where marijuana-use is greater than 20.

In [None]:
# A:

The ampersand sign can be used to subset where multiple conditions need to be met for each row. 

Subset the data for marijuana use over 20, as before, but now also where the n is greater than 4000.

In [None]:
# A:

In [None]:
# Sorting Demo - quick

#### Time Permitting:  Map + Apply Demo

<a id='review'></a>

### Review

---

 - What is axis 1 vs 0 and how can we use them?
 - How do we slice? Index? Filter?
 - Why might we inspect our datatypes?
 - How do we use Pandas plots with Matplotlib to create multiple sub-figures in a single figure?