
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Pandas

[Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) is a popular Python library among data scientists with high performing, easy-to-use data structures and data analysis tools.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br><br>
* Explain what pandas is and why it's so popular
* Create and manipulate pandas DataFrames and Series
* Perform operations on pandas objects

First, let us import pandas with the alias `pd` so we can refer to the library without having to type Pandas out each time. Pandas is pre-installed on Databricks.

In [0]:
import pandas as pd

pd.__version__

### Motivate why to use `pandas`

Let's start big picture...<br><br>

* Humans are tool using animals 
* Computers are one of the most powerful tools we've created
* If you write code, you can unlock the full power of these tools

Ok, cool. But why `pandas`?<br><br>

* More and more, data is leading decision making
* Excel is great but what if...
  - You want to automate your analysis so it re-runs on new data each day?
  - You want to build a code base to share with your colleagues
  - You want more robust analyses to feed a business decision
  - You want to do machine learning
* One of the core libraries used by data analysts and data scientists in Python

Enter `pandas`...

### Introduce `pandas` and its history

`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Highlights:

- Built in 2008, open sourced in 2009
- A fast and efficient **DataFrame object** for data manipulation with integrated indexing;
- Tools for **reading and writing data** between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent **data alignment** and integrated handling of **missing data**: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible **reshaping and pivoting** of data sets;
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets;
- Columns can be inserted and deleted from data structures for **size mutability**;
- Aggregating or transforming data with a powerful **group by** engine allowing split-apply-combine operations on data sets;
- High performance **merging and joining** of data sets;
- Hierarchical axis **indexing** provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- **Time series**-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly **optimized** for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of **academic and commercial domains**, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.


[Check out the book](https://www.amazon.com/gp/product/1491957662/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=quantpytho-20&creative=9325&linkCode=as2&creativeASIN=1491957662&linkId=ea8de4253cce96046e8ab0383ac71b33)

## Pandas DataFrames

Let's see how we can create a simple [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) in Pandas. We are going to create a list with 5 strings that we wish to store inside the DataFrame.

In [0]:
data = ["row one", "row two", "row three", "row four", "row five"]

pd.DataFrame(data=data)


The `0` in the very first row is the name of the column with defaults to integers if not specified. What if we want to add in another column and name the columns?

In [0]:
data = [["row one", 1], ["row two", 2], ["row three", 3], ["row four", 4], ["row five", 5]]
column_names = ["Strings", "Integers"]

df = pd.DataFrame(data=data, columns=column_names)
df

Now we have multiple columns in our DataFrame! We can check the types of columns in our df.

In [0]:
df.dtypes

What if our DataFrame has many rows and we don't want to print out our entire DataFrame? We can use the `.head()` and `.tail()` functions to limit the number of rows we see.

In [0]:
# look at the first 2 rows of df 
df.head(2)

In [0]:
# look at the last 2 rows of df 
df.tail(2)

If we had many columns and we didn't want to see all of them? We can select specific columns to include by using brackets and "indexing" our DataFrame. This would return us another DataFrame which we can display.

In [0]:
cols_to_show = ["Integers"]
df[cols_to_show]

## Pandas Series

If we only wanted to select out one column we can do the following.

In [0]:
df.Integers

In [0]:
df["Integers"]

The 2 cells above returned Pandas [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) objects instead of Pandas DataFrame objects. A Pandas Series is a single column of a Pandas DataFrame.

In [0]:
type(df), type(df["Integers"])

We can index into a Series to get an entry in a specific row. The index count starts at 0 by default.

In [0]:
df["Strings"][0]

## Operations

The benefit of having Pandas Series is the ability to easily perform mathematical operations on it.

In [0]:
df["Integers"] + df["Integers"]

In [0]:
df["Integers"] * df["Integers"]

We can create a new column in our DataFrame `df`.

In [0]:
df["New Column"] = df["Integers"] * 100
df

&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>