<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Introduction to Pandas



Pandas is the most popular Python package for managing data sets. It's used extensively by data scientists.

### Learning Objectives

- Define the anatomy of DataFrames.
- Explore data with DataFrames.

### Lesson Guide

- [Introduction to `pandas`](#introduction)
- [Loading CSV Files](#loading_csvs)
- [Exploring Your Data](#exploring_data)
- [Data Dimensions](#data_dimensions)
- [DataFrames vs. Series](#dataframe_series)
- [Using the `.info()` Function](#info)
- [Using the `.describe()` Function](#describe)
- [Independent Practice](#independent_practice)
- [Pandas Indexing](#indexing)
- [Creating DataFrames](#creating_dataframes)
- [Checking Data Types](#dtypes)
- [Renaming and Assignment](#renaming_assignment)
- [Logical Filtering](#filtering)
- [Sorting](#sorting)
- [Review](#review)

<a id='introduction'></a>

### What is a dataframe?

---
The concept of a "dataframe" comes from the world of statistical software used in empirical research; 
- Generally refers to "tabular" data: a data structure representing cases (rows), each of which consists of a number of observations or measurements (columns)
- **Each row** is treated as a **single observation** of **multiple "variables"** 
- The row ("record") datatype can be **heterogenous** (a tuple of different types) 
- The column datatype must be **homogenous**. 
- Data frames usually contain some **metadata** in addition to **data**; for example, column and row names (unlike Numpy by default)

<a id='introduction'></a>

### What is `pandas`?

---

- A data analysis library ‚Äî **P**anel **D**ata **S**ystem.
- It was created by Wes McKinney and open sourced by AQR Capital Management, LLC in 2009.
- It's implemented in highly optimized Python/Cython.
- It's the **most ubiquitous tool** used to start data analysis projects within the Python scientific ecosystem.


### Pandas Use Cases

---

- Cleaning data/munging.
- Exploratory analysis.
- Structuring data for plots or tabular display.
- Joining disparate sources.
- Modeling.
- Filtering, extracting, or transforming. 


![](https://snag.gy/tpiLCH.jpg)

![](https://snag.gy/1V0Ol4.jpg)

### Common Outputs

---

With `pandas` you can:

- Export to databases
- Integrate with `matplotlib`
- Collaborate in common formats (plus a variety of others)
- Integrate with Python built-ins (**and `numpy`!**)


### Importing `pandas`

---

Import `pandas` at the top of your notebook like so:

In [None]:
import pandas as pd
import numpy as np

Recall that the **`import pandas as pd`** syntax nicknames the `pandas` module as **`pd`** for convenience.

<a id='loading_csvs'></a>

### Loading a CSV into a DataFrame

---

`pandas` can load many types of files, but one of the most commonly used for storing data is a ```.csv```. As an example, let's load a data set on diamond characteristics from the ```./datasets``` directory:

In [None]:
diamonds = pd.read_csv('./datasets/diamonds.csv')

This creates a `pandas` object called a **DataFrame**. DataFrames are powerful containers, featuring many built-in functions for exploring and manipulating data.

We will barely scratch the surface of DataFrame functionality in this lesson, but, throughout this course, you will become an expert at using them.

In short, a dataframe is a supercharged 2D array:
    - it has the data
    - it has information about it (meta-data - like columns names, etc...)

<a id='exploring_data'></a>

### Exploring Data using DataFrames

---

DataFrames come with built-in functionality that makes data exploration easy. 

To start, let's look at the **"header"** of your data using the ```.head()``` function. If run alone in a notebook cell, it will show you the first handful of columns in the data set, along with the first five rows.

In [None]:
diamonds.head()

If we want to see the last part of our data, we can use the ```.tail()``` function equivalently.

In [None]:
diamonds.tail()

<a id='data_dimensions'></a>

### Data Dimensions

---

It's always good to look at the dimensions of your data. The ```.shape``` property will tell you how many rows and columns are contained within your DataFrame.

In [None]:
diamonds.shape

As you can see, we have 53940 rows and 10 columns.

You'll also notice that this function operates the same as `.shape` for `numpy` arrays/matricies. **`pandas` makes use of **numpy** under its hood** for optimization and speed.

You can look up the names of your columns using the ```.columns``` property.


In [None]:
diamonds.columns

Accessing a specific column is easy. You can use **bracket** syntax just like you would with **Python dictionaries**, using the column's string name to extract it.

In [None]:
diamonds['carat'].head()

As you can see, we can also use the ```.head()``` function on a single column, which is represented as a `pandas` Series object.

With a **list of strings**, you can also access a column (as a DataFrame instead of a Series).

In [None]:
diamonds[['carat']].head()

In [None]:
diamonds[['cut','carat']].head()

<a id='dataframe_series'></a>

### DataFrame vs. Series

---

There is an important difference between using a list of strings versus only using a string with a column's name: When you use a list containing the string, it returns another **DataFrame**. But, when you only use the string, it returns a `pandas` **Series** object.

In [None]:
print(type(diamonds['cut']))

In [None]:
print(type(diamonds[['cut']]))

**Breakout (2min):** What's the difference between `pandas` Series and DataFrame objects?

As long as your column names **don't contain any spaces** or other specialized characters (underscores are OK), you can access a column as a property of a DataFrame.  

**Get in the habit of referencing your Series columns using `df['my_column']` rather than with object notation (`df.my_column`)**. There are many edge cases in which the object notation does not work, along with nuances as to how `pandas` will behave.

In [None]:
diamonds['cut'].head()

Remember: This will be a **Series** object, not a **DataFrame**.

<a id='info'></a>

### Examining Your Data With `.info()`

---

When getting acquainted with a new data set, `.info()` should be **the first thing** you examine.

**Types** are very important. They affect the way data will be **represented** in our machine learning models, how data can be joined, whether or not math operators can be applied, and instances in which you can encounter unexpected results.

> _Typical problems that arise when working with new data sets include_:
> - Missing values.
> - Unexpected types (string/object instead of int/float).
> - Dirty data (commas, dollar signs, unexpected characters, etc.).
> - Blank values that are actually "non-null" or single white-space characters.

`.info()` is a function available on every **DataFrame** object. It provides information about:

- The name of the column/variable attribute.
- The type of index (RangeIndex is default).
- The count of non-null values by column/attribute.
- The type of data contained in the column/attribute.
- The unique counts of **dtypes** (`pandas` data types).
- The memory usage of our data set.

#### For example: 

In [None]:
diamonds.info()

<a id='describe'></a>

### Summarizing Data with `.describe()`

---

The ```.describe()``` function is useful for taking a quick look at your data. It returns some basic descriptive statistics.

For our example, use the ```.describe()``` function on only the ```carat``` column.

In [None]:
diamonds['carat'].describe()

You can also use it on multiple columns, such as ```carat``` and ```price```. They will need to be numeric types. 

In [None]:
diamonds[['carat','price']].describe()

```.describe()``` gives us the following statistics:

- **Count**, which is equivalent to the number of cells (rows).
- **Mean**, or, the average of the values in the column.
- **Std**, which is the standard deviation.
- **Min**, a.k.a., the minimum value.
- **25%**, or, the 25th percentile of the values.
- **50%**, or, the 50th percentile of the values ( which is the equivalent to the median).
- **75%**, or, the 75th percentile of the values.
- **Max**, which is the maximum value.

<img src="https://snag.gy/AH6E8I.jpg">

#### Summary Functions

There are also built-in math functions that will work on a column of a DataFrame. 

For example, I can use the ```.mean()``` function on the `carat` column to get the mean carat weight for all the diamonds.

In [None]:
diamonds[['carat']].mean()

We can also use them on multiple columns:

In [None]:
diamonds[['carat','price']].max()

if the columns are all numeric you can call them directly on the DataFrame, otherwise you might get a warning:

In [None]:
diamonds.mean()

<a id='independent_practice'></a>

### Independent Practice

---

Now that we know a little bit about basic DataFrame use, let's practice on a new data set.

> Pro tip: When your cursor is in a string, you can use the "tab" key to browse file system resources and get a relative reference for the files that can be loaded in Jupyter notebook. Remember, you have to use your arrow keys to navigate the files populated in the UI. 

<img src="https://snag.gy/IlLNm9.jpg">

1. Find and load the `cars` data set into a DataFrame (in the `datasets` directory).
2. Print out the columns.
3. What does the data set look like in terms of dimensions?
4. Check the types of each column.
  a. What is the most common type?
  b. How many entries are there?
  c. How much memory does this data set consume?
5. Examine the summary statistics of the data set.

In [None]:
csv_file = "datasets/cars.csv"
cars = pd.read_csv(csv_file)

In [None]:
# column names


In [None]:
# shape

In [None]:
# column data types, number of entries and memory used. seems like alot of info


In [None]:
# summary stats 


<a id='indexing'></a>

### `pandas` Indexing 

---

More often than not, we want to operate on or extract specific portions of our data. When we perform indexing on a DataFrame or Series, we can specify a certain section of the data.

`pandas` has three properties you can use for indexing:

- **`.loc`** indexes with the _labels_ for rows and columns.
- **`.iloc`** indexes with the _integer positions_ for rows and columns.

To help clarify these differences, let's first reset the row labels to letters using the ```.index``` attribute:

In [None]:
new_index_values = ['A','B','C','D','E','F','G','H','I','J','K','L','M',
                    'N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
cars.index=new_index_values


In [None]:
cars.head()

Using the **`.loc`** indexer, we can pull out rows **'B' to 'F'** and the **`hp` and `wt`** columns.

In [None]:
subset = cars.loc[['B','C','D','E','F'], ['hp','wt']]

In [None]:
subset

We can do the same thing with the **`.iloc`** indexer, but we have to use integers for the location.

In [None]:
subset = cars.iloc[[1,2,3,4,5], [3,5]]

In [None]:
subset

If you try to index the rows or columns with integers using **`.loc`**, you will get an error.

Note that you can automatically reorder the data just by reordering the indices you enter when you perform the indexing operation!

While we created an index earlier, we can also use a column to set an index.

In [None]:
cars.index = cars['mpg']

cars.head()

Is mpg the best feature to use as an index?  

If it isn't we can use the `df.reset_index()` to reset our index.

In [None]:
cars.reset_index(drop=True, inplace=True)
cars.head()

<a id='creating_dataframes'></a>

### Creating DataFrames

---

The simplest way to create your own DataFrame without importing data from a file is to give the ```pd.DataFrame()``` instantiator a dictionary.

In [None]:
mydata = pd.DataFrame({'Letters':['A','B','C'], 'Integers':[1,2,3], 'Floats':[2.2, 3.3, 4.4]})

In [None]:
mydata

As you might expect, the dictionary needs to have lists of values that are all the same length. The keys correspond to the names of the columns, and the values correspond to the data in the columns.

<a id='dtypes'></a>

### Examining Data Types

---

`pandas` comes with a useful property for looking solely at the data types of your DataFrame columns. Use ```.dtypes``` on your DataFrame:

In [None]:
mydata.dtypes

This will show you the data type of each column. Strings are stored as a type called "object," as they are not guaranteed to take up a set amount of space (strings can be any length).

<a id='renaming_assignment'></a>

### Renaming and Assignment

---

`pandas` makes it easy to change column names and assign values to your DataFrame.

Say, for example, we want to change the column name `Integers` to `int`:

In [None]:
mydata.rename(columns={mydata.columns[1]:'int'}, inplace=True) # inplace = True updates mydata
print(mydata.columns)

In [None]:
mydata

If you want to change every column name, you can just assign a new list to the ```.columns``` property.

In [None]:
mydata.columns = ['A','B','C']
print(mydata.head())

<a id='filtering'></a>

### Filtering Logic

---

One of the most powerful features of DataFrames is the ability to use logical commands to filter data.

Subset the ```diamonds``` data for only the rows in which `price` is greater than 7000.

In [None]:
diamonds[diamonds['price'] > 7000]

#### Filtering on Multiple Conditions

We can also filter on _multiple conditions_. 
The format for multiple conditions is:

`df[ (df['col1'] == value1) & (df['col2'] == value2) ]`

Or, more simply:

`df[ (CONDITION 1) & (CONDITION 2) ]`

Which eventually may evaluate to something like:

`df[ True & False ]`

...on a row-by-row basis. If the end result is `False`, the row is omitted.

_Don't forget parentheses in your conditions!_ This is a common mistake.

#### Example 
Subset the data for `price` greater than 7000 like before, but now, also include where the `cut` is 'Ideal'.

In [None]:
diamonds[(diamonds['price'] > 7000) & (diamonds['cut']== 'Ideal')]

In [None]:
# What about diamonds where the cut is Premium OR the carat weight is greater than 0.50?
# "Or" logic - use pipe (|)

In [None]:
diamonds[(diamonds['cut']=='Premium')|(diamonds['carat']>0.5)]

#### Calculations on filtered data

Let's calculate the **mean carat weight** for diamonds with `cut` **Premium**
> Think: What are the component parts of this problem?

In [None]:
# First find the diamonds where the cut == 'Premium'
# Then select the 'carat' column as a series
# Finally, find the mean
mean_carat = diamonds[diamonds['cut']=='Premium']['carat'].mean()
print(f'{mean_carat:.2f}')

<a id='sorting'></a>

### Sorting

We can sort the DataFrame by Series, or by the entire DataFrame by specifying which columnd to sort. 

In [None]:
# We can sort individual Series...
cars['mpg'].sort_values().head()

In [None]:
# sort in descending order
cars['mpg'].sort_values(ascending=False).head()

In [None]:
# Sort the entire DataFrame by the specific column
cars.sort_values('mpg').head()

In [None]:
# Or the entire DataFrame by more than one column using lists
# cars.sort_values(by=??? , ascending=??? ).head()

## Independent Practice

With our cars dataset already loaded, let's explore our dataset a bit more thoroughly to gain some familiarity with beginning exploratory analysis.

### 1. Select only data for "cars" when "hp" is "150-200".

### 2. Select only rows with index 5-10, for variables / columns "disp" and "hp"

### 3. Select the columns by numeric offset 2-5, rows with numeric index 3-7

### 4. Find the mean of `hp`


### 4. Find the mean of `hp` for cars with more then 4 cylinders

### 5. Sort the `cars` DataFrame by `cyl` in ascending order, then `mpg` in descending order

<a id='review'></a>

### Review 

We covered a lot of ground! It's ok if this takes a while to gel.

```python

# basic DataFrame operations
df.head()
df.tail()
df.shape
df.columns
df.Index
df.info()

# selecting columns
df.column_name
df['column_name']
df[['column_name1','column_name2','column_name3']]

# notable columns operations
df.describe() # five number summary
df['col1'].nunique() # number of unique values
df['col1'].value_counts() # number of occurrences of each value in column

# filtering
df[ df['col1'] < 50 ] # filter column to be less than 50
df[ (df['col1'] == value1) & (df['col2'] > value2) ] # filter column where col1 is equal to value1 AND col2 is greater to value 2

# sorting
df.sort_values(by='column_name', ascending = False) # sort biggest to smallest

```


It's common to refer back to your own code *all the time.* Don't hesistate to reference this guide! üêº
