# Data Manipulation with Pandas

<div class="alert alert-block alert-danger">
<b>Check the Kernel you are using:</b> Before we get started, if you are running this on HiPerGator, double check the kernel in use. This is shown in the top right of the window and should look like: <img src="images/kernel.python310.png" alt"Image showing that the notebook is using the Python 3.10 kernel" style="float:right">
</div>

This notebook is based on [chapter 3 of Jake VanderPlas' Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html). [<img src="images/PDSH-cover-small.png" alt="PDSH Cover Image" style="width: 50px;float:right"/>](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html)

As the chapter points out, Pandas builds on NumPy. We saw how NumPy has great features for fast, memory efficient numerical computations. But if you think about data, it is often organized in tables. NumPy has no functionality to label columns or rows with meaningful names. Reffering to managing data, which often include missing values and non-conforming data types, the text notes:
> Pandas, and in particular its `Series` and `DataFrame` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

Let's load NumPy and Pandas, using the standard alias `pd` and get the version we have:

In [None]:
import numpy as np
import pandas as pd
pd.__version__

The text mentions the three main data types that Pandas provides: `Series`, `DataFrame` and `Index`. We'll stat with `Series` as it helps connect NumPy arrays with Pandas objects.

## Pandas `Series`

> As we see in the output, the `Series` wraps both a sequence of values and a sequence of indices, which we can access with the `values` and `index` attributes. The `values` are simply a familiar NumPy array:

> The essential difference \[between a NymPy array and a Pandas `Series`\] is the presence of the index: while the Numpy Array has an *implicitly* defined integer index used to access the values, the Pandas `Series` has an *explicitly* defined index associated with the values.

In [None]:
# Indices do not need to be integers:

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

As the text points out, this is similar to a Python dictionary. But with Pandas `Series` both the index and the values are typed. As with NymPy arrays, the explicit typing in `Series` add efficiency in memmory and computation. 

## Pandas `DataFrame`

The main data structre that makes Pandas popular is the `DataFrame` class. 

> If a `Series` is an analog of a one-dimensional array with flexible indices, a `DataFrame` is an analog of a two-dimensional array with both flexible row indices and flexible column names.

Here's the example of two `Series` that the text uses to combine into a `DataFrame`:

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
# In addition to the index, DataFrames have columns



## Accessing data in `DataFrame`s

In [None]:
 # Careful: Using attributes can break

In [None]:
# Add a new column



## Index alignment

I'm skipping a bunch of stuff in the text, but want to show this important and useful feature of `Series`.

> For binary operations on two `Series` or `DataFrame` objects, Pandas will align indices in the process of performing the operation. This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

In [None]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

Notice how Pandas seamlessly manages the combination of these partially overlapping `Series` indices.

## Handling Missing Data

> The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.

> There are a number of schemes that have been developed to indicate the presence of missing data in a table or DataFrame. Generally, they revolve around one of two strategies: using a **mask** that globally indicates missing values, or choosing a **sentinel** value that indicates a missing entry.

### `None`: Pythonic missing data

> The first sentinel value used by Pandas is `None`, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, `None` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type `'object'` (i.e., arrays of Python objects):

### `NaN`: Missing numerical data

> The other missing data representation, `NaN` (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

But be a bit careful...`NaN` "is a bit like a data virusâ€“it infects any other object it touches."

In [None]:
# There are some NaN aware functions



> Keep in mind that `NaN` is specifically a floating-point value; there is no equivalent `NaN` value for integers, strings, or other types.

## Dealing with null values

In [None]:
data = pd.Series([1, np.nan, 'hello', None])

In [None]:
# Detecting null values


In [None]:
# Use a boolean mask to get non-null values


In [None]:
# Or more simply dropna


In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

In [None]:
# dropna by default drops rows with NaN


In [None]:
# Or use axis= 1 or 'columns'


In [None]:
# There is a how but thresh is probably more usefull
# Minimum number of non NaN values to keep



## Filling Null Values

> Sometimes rather than dropping `NA` values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the `isnull()` method as a mask, but because it is such a common operation Pandas provides the `fillna()` method, which returns a copy of the array with the null values replaced.

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

In [None]:
# forward-fill


In [None]:
# back-fill


In [None]:
# Similar in DataFrames

df[3] = np.nan # I skipped a step earlier where the text added a column of NaNs
df