![Erudio logo](img/erudio-logo-small.png)
---
![Pandas logo](img/pandas-logo-small.png)

# What is Pandas?

There are a couple ways of looking at what the large, and widely-used Pandas library is.  

From an infrastructure and tool creation point-of-view, Pandas is a layer on top of NumPy that add the crucial concept of "labeled data."  We have some of that with record arrays in NumPy, but Pandas goes much further in that direction.  As well as providing friendlier and more powerful ways of selecting data, Pandas adds a large number of additional functions and methods for various kinds of computations processing data.

Remember the picture we showed when introducing NumPy of the PyData ecosystem.

![NumPy ecosystem](img/numpy-ecosystem.png)

Another way of looking at Pandas is in terms of the workflows it typically enables.  Pandas comes with many functions to read different data sources, as well as hooks for visualization of data and presenting aggregate results.  Many data scientists, in particular, spend almost all their work during a day reading, processing, and utilizing Pandas DataFrames.  For a certain kind of expert, nearly every task in their daily work is done within Pandas.

```{image} img/data.pandas.profit.png
:alt: Pandas data profit
:width: 75%
:align: center
```

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from src.training import *

## Working with Data, Pandas Style

The next cells will probably not all make sense at first glance. They utilize a lot of new concepts and APIs related to Pandas that you have not yet been taught.  However, it is useful as an immersion just to see a representation of Pandas capabilities. Later modules will return to each of these areas in much more detail.

### Reading data

Here we will do a number of operations with the same Wisconsin Breast Cancer dataset that was used in exercises for the Advanced NumPy module.  If you did not happen to go through that module, this is a widely used example data set for machine learning and other purposes.  It contains observations of a number of biopsied tumors, some benign, some malignant, with numerous features measured on each one.

In [None]:
# Note that we very often call the "current DataFrame" simply `df`
cancer = pd.read_csv('data/wisconsin.csv')                     
cancer

### Summary statistics

We can get a quick picture of the data with some general DataFrame methods.

In [None]:
cancer.info()

In [None]:
cancer.describe()

In [None]:
cancer[['mean radius', 'mean texture', 'mean perimeter', 'mean area']].skew()

### Selections

One limitation of Pandas DataFrame versus NumPy arrays is that DataFrames are inherently 2-D.  A Pandas capability called "hierarchical indexing" is discussed in later lessons, and provides a way to simulate dimensionality.  But mostly, think of Pandas as giving you 2-D tables.

Within the 2-D of DataFrames, however, selection of particular data is generally more intuitive and obvious than the equivalent action on a NumPy array.

In [None]:
cancer.loc[cancer['benign?'] == 1, 'worst radius':'worst area']

### Grouping data

Pandas gives us SQL-like ability to group related data for aggregations

In [None]:
cancer.groupby('benign?').mean().T  # Transpose is more readable

In [None]:
def spread(s):
    return s.max() - s.min()

cancer.groupby('benign?').agg(['mean', 'std', spread])

### Plotting

Pandas DataFrames have a method `.plot()` (and a few others) that will call out to Matplotlib to create graphs of data within the DataFrames.

In [None]:
cancer['mean radius'].plot(kind='hist', title="Histogram of Mean Radius", bins=15);

In [None]:
df = cancer.sort_values('mean radius')
df.reset_index()

In [None]:
# Using the Pandas "fluent style" to chain operations
(cancer
     .loc[cancer['benign?'] == 0]
     .sort_values('mean perimeter')
     .reset_index()
     [['mean area', 'worst area']]
 .plot(title='Area measures of malignant tumors sorted by perimeter')
);

# Exercises

Let us perform some more analysis of the same Wisconsin cancer dataset we have used for demonstration.

Create graphs similar to those shown above.  But rather than compare visually mean and worst area, according to perimeter, make the same comparison for radii.  Moreover, visualize the data separately for benign versus malignant tumors, and characterize the differences in the patterns you see in your own words.

In [None]:
# Load the data
cancer = ...

# Visualize patterns
...

# Describe the pattern
pattern = """
It appears that ...
"""

---

Of those observations that have a larger than median value of "mean radius", what is the mean and standard deviation of their "concavity error"? 

Answers: 
* mean=0.036131897
* stddev=0.02302538

In [None]:
# Mean/standard deviation of high median mean radius
...

Among the benign tumors, what is the correlation coefficient between "mean symmetry" and "mean fractal dimension" (i.e. Pearson product-moment correlation coefficient). 

Answer: 
* coefficient=0.41905971

In [None]:
# Correlation coefficient between mean symmetry and mean fractal dimension
...

Which feature in the data shows the highest magnitude variance? 

Answer: worst area

In [None]:
# Feature with highest variance
...

Ignoring the target, which feature shows the highest normalized standard deviation? I.e. standard deviation as a percentage of the entire value range of that feature.

Answer: worst concave points

In [None]:
# Highest normalized standard deviation
...


---

Materials licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) by the authors