### Intro to Pandas

<img src="https://wow.gamepedia.com/media/wow.gamepedia.com/6/6c/Twopandaren.jpg" alt="Drawing" style="width: 300px; float: right; padding: 40px"/>

What is Pandas?

From the homepage...
> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

And from the docs...
> pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

It's worth at least reading through the overview of features [here](http://pandas-docs.github.io/pandas-docs-travis/), and trying out their '10 minutes to Pandas' feature [here](http://pandas-docs.github.io/pandas-docs-travis/10min.html). There is also a great video series available [here](https://www.safaribooksonline.com/library/view/introduction-to-pandas/9781771375764/) for more details.

For our purposes, pandas is a library that makes working with large arrays of data easy, much like a spreadsheet tool. We can import excel files, CSVs, etc., and look at the files from the perspective of rows and columns. If we only used NumPy, we would be looking at nameless arrays and it would be more difficult to wrap our mind around the manipulations we are performing on the data.

A huge part of machine learning (some would say the biggest), is the data-cleaning and feature-engineering phase. This is where we take our raw input data and manipulate it into something that will act as good training data for ML algorithms. Pandas is extremely useful for these two tasks.


---

### Iris Dataset Example
<img src="http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/iris_petal_sepal.png" alt="Drawing" style="width: 200px; float: right; padding: 10px"/>

One of the most famous datasets in ML is the 'Iris Dataset'. Created in the 1930s to help research some pre-computer classification techniques, it samples 150 flowers from 3 separate species, each with 4 measurements, along with the actual species of that flower.

The ML goal when looking at this set is to develop a model that can predict what species of flower a sample is when presented with those 4 measurements.

Let's look at the iris dataset, and see how we can use Pandas to do some basic manipulation and analysis on it.

Execute the following cells:

In [None]:
# usually we would load data from a CSV or other data file, but Scikit comes with some popular datasets built-in.
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
print(iris.DESCR)

In [None]:
# let's see what is in the iris object Scikit gave us:
iris.keys()

In [None]:
# what are the features?
iris.feature_names

In [None]:
# what are the targets?
iris.target_names

In [None]:
# what does the target data look like?
iris.target

In [None]:
# What does the data look like? (let's just look at first 5 rows)
iris.data[:5]

It can be a bit cumbersome dealing with raw arrays, and numbers instead of labels. Pandas gives us a much nicer API to explore the data with, but first we need to build the core Pandas object, the **DataFrame**.

DataFrames can be built in many different ways, including loading CSVs, Excel files, connecting to Databases, etc. Here, we're going to build one using the Arrays coming from the iris data.

In [None]:
# it's this easy!
# note how it has auto-assigned an index (far-left column).
# these indices become a powerful feature of Pandas as dig into it.

iris_features_df = pd.DataFrame(iris.data)
iris_features_df.columns = iris.feature_names
iris_features_df

In [None]:
# let's create another DataFrame to hold the labels, or target data from the data.
# use the head method to just see the first few rows
iris_targets_df = pd.DataFrame(iris.target)
iris_targets_df.columns = ['label']
iris_targets_df.head()

In [None]:
# DataFrames can be combined using joins, much like in SQL.
# Let's join the feature and target DFs based on their index values:

iris_df = pd.merge(left_index=True, right_index=True, left=iris_features_df, right=iris_targets_df)
iris_df.head()

In [None]:
# it's starting to look good, but it's tough to remember what 0,1, and 2 represent in the label column.
# we can easily map those values to more meaningful terms.
label_map = {
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica'
}
iris_df = iris_df.replace({'label': label_map})
iris_df.head()

In [None]:
# pandas knows the types of the different data stored inside it
iris_df.dtypes

In [None]:
# pandas can give us quick summaries of the data contained within a DF
iris_df.describe(include='all')

In [None]:
# for non-numeric, or 'categorical' values, we can see how they break down by frequency
iris_df['label'].value_counts()

In [None]:
# what if we want to dig in on 'versicolor' ? 
# we can create a 'mask', which is a boolean array indicating whether something is true or not for a given row.
versicolor_mask = iris_df['label'] == 'versicolor'
versicolor_mask

In [None]:
# we can then use that mask against the dataframe to view the filtered data
iris_df[versicolor_mask]

In [None]:
# and then perform analysis on that filtered data
iris_df[versicolor_mask].describe()

In [None]:
# the mean with for versicolor's was 2.77cm, but overall it was 3.05
# maybe we can just find versicolors based on this? 
small_width_mask = iris_df['sepal width (cm)'] < 2.5
iris_df[small_width_mask]['label'].value_counts()

# try changing the treshhold and see how it manages to find them

In [None]:
# we can also use more complex boolean queries

complex_mask = (iris_df['sepal width (cm)'] < 3) \
    & (iris_df['sepal length (cm)'] < 6 )
    
maybe_versicolor_df = iris_df[complex_mask]
maybe_versicolor_df['label'].value_counts()

The [Decision Tree](http://scikit-learn.org/stable/modules/tree.html) algorithm would automatically build a more sophisticated version of the rule above to classify versicolor, and all of the other labels, and it does a pretty good job!

Pandas also makes it easy to output your DataFrames to CSVs, for analysis in Excel/etc. or for sharing with colleagues. Try running the following cell and see if you can open the resultant spreadsheet.

In [None]:
# output to CSV for further analysis

import os
os.makedirs('./tmp', exist_ok=True)
maybe_versicolor_df.to_csv('./tmp/maybe_versi.csv')

--- 

This is not even skimming the surface of Pandas' capabilties. Some other important features include handling of missing data, tools for time-series data, and some sophisticated grouping features for things like aggregation and filtering. It's worth browsing the [Pandas Docs](https://pandas.pydata.org/pandas-docs/stable/#) to get a sense of all that is possible.

It's also worth noting that while this dataset only had 150 rows, Pandas is focused on speed, and can handle datasets with millions of rows very quickly. If you ever have trouble with a spreadsheet in Excel, it might be worth trying to load it in Pandas and do your queries there to save some time!

#### Congratulations !

You are now familiar with Jupyter, Python, Numpy, and Pandas. You're ready to start digging into some real ML tools now. 

To be continued...