![](https://snag.gy/h9Xwf1.jpg)

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Introduction to `pandas` 1

_Authors: Dave Yerrington (SF)_

---

`pandas` is the most popular python package for managing datasets and is used extensively by data scientists.

### Learning Objectives

- Explain the difference:  Series vs DataFrame.
- Describe basic characteristics of DataFrames.
- Practice plotting with pandas.

### Lesson Guide

- [Introduction to `pandas`](#introduction)
- [Loading csv Files](#loading_csvs)
- [Exploring Your Data](#exploring_data)
- [Data Dimensions](#data_dimensions)
- [DataFrames vs. Series](#dataframe_series)
- [Using the `.info()` Function](#info)
- [Using the `.describe()` Function](#describe)
- [Independent Practice](#independent_practice)

### A note on different Pandas versions.

It's important to realize there are some differences between Pandas versions.  If you are using the version of the Pandas library that this lesson is tested with, this notebook should work correctly.  However, you should expect some differences at times when working with our materials related to Pandas, but you should be able to adapt any minor differences in Pandas working with different versions.

Currently, this notebook is tested in v0.19.2 and v0.22.0 of Pandas.

> The cell below can be run to see which version you are using currently.

In [2]:
import pandas as pd

pd.__version__

'0.22.0'

<a id='introduction'></a>

### What is `pandas`?

---

- Data analysis library - **P**anel **D**ata **S**ystem.
- Created by Wes McKinney and Open Sourced by AQR Capital Management, LLC 2009.
- Implemented in highly optimized Python/Cython.
- Most ubiquitous tool used to start data analysis projects within the Python scientific ecosystem.


### Pandas Use Cases

---

- Cleaning data / Munging
- Exploratory Analysis
- Structuring data for plots or tabular display
- Joining disparate sources
- Modeling
- Filtering, extracting, or transforming 


### Discussion:  What do you think are some challenges when accessing data?

Follow up:  Do you really feel it's right that data scientists typically have to clean so much data?

![](https://snag.gy/tpiLCH.jpg)

![](https://snag.gy/1V0Ol4.jpg)

### Common Outputs

---

- Export to Databases
- Integrated with `matplotlib`
- Collaborate in common formats (plus a variety of others)
- Integration with Python built-ins (**and `numpy`!**)


### Importing `pandas`

---

Import pandas at the top of your notebook like so:

In [3]:
import pandas as pd

Recall that the **`import pandas as pd`** syntax nicknames the `pandas` module as **`pd`** for convenience.

<a id='loading_csvs'></a>

### Loading a csv into a DataFrame

---

Pandas can load many types of files, but one of the most common filetypes for storing data is in a ```.csv``` file. Let's load a dataset on drug use by age from the ```./datasets``` directory:

In [4]:
drug = pd.read_csv('./datasets/drug-use-by-age.csv')

This creates a pandas object called a **DataFrame**. These are powerful containers for data with many built-in functions to explore and manipulate data.

We will barely scratch the surface of DataFrame functionality in this lesson, but over the course of this class you will become an expert at using them.

<a id='exploring_data'></a>

### Exploring data using DataFrames

---

DataFrames come with built-in functionality that makes data exploration easy. 

Let's start by looking at the "header" of your data with the ```.head()``` built-in function. If run alone in a notebook cell, it will show you the first and last handful of columns and the first 5 rows.

In [None]:
# inspect the "head"

If we want to see the last part of our data, we can equivalently use the ```.tail()``` function.

In [6]:
# inspect the "tail"

<a id='data_dimensions'></a>

### Data dimensions

---

It's good to look at what the dimensions of your data are. The ```.shape``` property will tell you the rows and colum counts of your DataFrame.

> _Protip on dimensions_
>
> _The scale of a data problem can be determined by the dimensions of a dataset and inspecting how much space it occupies in memory (we will look at this later)._

In [7]:
# inspect "shape"

You can see we have 17 rows and 28 columns. This is obviously a small dataset.

You will notice that this is operates the same as `.shape` for numpy arrays/matricies. Pandas makes use of numpy under the hood for optimization and speed.

Look at the names of your columns with the ```.columns``` property.

[Note: You will see the columns having the **u'string'** and can most of the time safely ignore this as the column names are typically loaded in as ascii and not unicode]

In [8]:
# Inspect columns

Accessing a specific column is easy. You can use the bracket syntax just like python dictionaries with the string name of the column to extract that column.

In [9]:
# Insepct head of a series called "crack-use"

As you can see we can also use the ```.head()``` function on a single column, which is represented as a pandas Series object.

You can also access a column (as a DataFrame instead of a Series) or multiple columns with a list of strings.

In [11]:
# Inspect feature (aka: column) called "crack-use"

In [12]:
# Inspect features "age", "crack-use" with head

In [10]:
# Inspect unique values of "age"

<a id='dataframe_series'></a>

### DataFrame vs. Series

---

There is an important difference between using a list of strings and just a string with a column's name: when you use a list with the string it returns another **DataFrame**, but when you use just the string it returns a pandas **Series** object.

In [12]:
# Lets check this out, series vs DataFrame
# drug[['age']]

In [13]:
print(type(drug['age']))

print(type(drug[['age']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


What is the difference between a pandas Series and DataFrame object?

Essentially, a **Series** object contains the data for a single column of your data, and the **DataFrame** is a matrix-like container for those Series objects that comprise your data.

As long as your column names have no spaces or other specialized characters in them (underscores are ok), you can access a column as a property of the dataframe.  

**Get in the habit of referencing your series columns using df['my_column'] rather than by object notation `df.my_column`**.  Because, there are many edge cases where the object notation does not work and there are nuances as to how Pandas will behave, **`df.my_column`** vs **`df['my_column']`**.

> When in doubt, use the method that behaves most consistently for accessing columns as you learn to use DataFrames.

In [13]:
# Head of column bracket reference vs class attribute for "age"

Remember: this will be a **Series** object, not a DataFrame.

#### How many series do you think exist in the DataFrame "drug"?

In [15]:
## Inspect the DataFrame called drug on your own

<a id='info'></a>

### Examining your data with `.info()`

---

The `.info()` should be the first thing you look at when getting acquainted with a new dataset.

**Types** are very important.  They impact the way data will be represented in our machine learning models, how data can be joined, whether or not math operators can be applied, and when you can encounter unexpected results.

> _Typical problems when working with new datasets_:
> - Missing values
> - Unexpected types (string/object instead of int/float)
> - Dirty data (commas, dollar signs, unexpected characters, etc)
> - Blank values that are actually "non-null" or single white-space characters

`.info()` is a function that is available on every **DataFrame** object. It gives you information about:

- Name of column / variable attribute
- Type of index (RangeIndex is default)
- Count of non-null values by column / attribute
- Type of data contained in column / attribute
- Unqiue counts of dtypes (Pandas data types)
- Memory usage of our dataset


In [16]:
# Inspect the "drug" dataframes `info` attribute.

### Caveat:  Working with Larger Datasets 

---

If you have a dataset that is larger than your given memory, there are better solutions for working with your data.

![](https://snag.gy/UGNamo.jpg)

Generally:

- Consider storing your data in a relational database.
- Use HDF5 (PyTables) if you need to operate on all of the data.
- Take a sample of your larger dataset, approximating the total data, before importing or downloading.
- Consider distributed computing environment like Hadoop, Starcluster, or Spark (there are even more options and considerations for this but we will cover them in the future!).


<a id='describe'></a>

### Summarizing data with `.describe()`

---

The ```.describe()``` function is very useful for taking a quick look at your data. It gives you some of the basic descriptive statistics.

Use the ```.describe()``` function on just the ```crack-use``` column.

In [16]:
## Describe "crack-use" of "drug" DataFrame

count    17.000000
mean      0.294118
std       0.235772
min       0.000000
25%       0.000000
50%       0.400000
75%       0.500000
max       0.600000
Name: crack-use, dtype: float64

You can use it on multiple columns, such as ```crack-use``` and ```alcohol-frequency```.

In [17]:
# Might also describe with transposed output

```.describe()``` gives us these statistics:

- **count**, which is equivalent to the number of cells (rows)
- **mean**, the average of the values in the column
- **std**, which is the standard deviation
- **min**, the minimum value
- **25%**, the 25th percentile of the values 
- **50%**, the 50th percentile of the values, which is the equivalent to the median
- **75%**, the 75th percentile of the values
- **max**, the maximum value

<img src="https://snag.gy/AH6E8I.jpg">

There are built-in math functions that will work on all of the columns of a DataFrame at once, or subsets of the data (series).

We can use the ```.mean()``` function on the ```drug``` DataFrame to get the mean for every column.

In [19]:
# aggregate mean on drug dataframe

#### Are there any characteristics we may not see in our dataset with "describe" that you think could be important?

<a id='independent_practice'></a>

### Independent Practice

---

Now that we know a little bit about basic DataFrame use, let's practice on a new dataset.

> Pro tip:  You can use the "tab" key to browse filesystem resources when your cursor is in a string to get a relative reference to the files that can be loaded in Jupyter notebook.  Remember, you have to use your arrow keys to navigate the files populated in the UI. 

<img src="https://snag.gy/IlLNm9.jpg">

1. Find and load the "diamonds" dataset into a DataFrame (in the datasets directory).
1. Print out the columns.
1. What does the dataset look like in terms of dimensions?
1. Check the types of each column.
  1. What is the most common type?
  1. How many entries are there?
  1. How much memory does this dataset consume?
1. Examine the summary statistics of the dataset.

In [52]:
csv_file = "update this"
diamonds = pd.read_csv(csv_file)

### Finish the practice here.

### Conclusion

1. If we considered the cleanliness of a dataset, which aspects would you be most concerned with, but most importantly, how would you inspect / investigate it to determine how clean it was?
1. Which potential problems with data, might you think could arrise before predictive modeling / machine learning?
1. What can you do with a DataFrame that you can't do with a series?
