# Pandas Illustrated: The Definitive Visual Guide to Pandas

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43

[3.5万字图解 Pandas](https://blog.csdn.net/cainiao_python/article/details/130143504)

[10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

![img](pandas.assets/1MDyxk2ivjo9sD2_kd_B1TQ.png)

[Pandas](https://pandas.pydata.org/) is an industry standard for analyzing data in Python. 

1. Motivation
2. Series and Index
3. DataFrames
4. MultiIndex

In [None]:
import pandas as pd

# Part 1. Motivation and Showcase

![img](pandas.assets/1lZH0zLkDU01MbmLoq2B3SQ.png)

Now, here’re a couple of examples of what Pandas can do for you that NumPy cannot (or requires significant effort to accomplish).

**1. Sorting**
![img](pandas.assets/1DKVpqZ00lqWTIducEHbdNQ.png)

**2. Sorting by several columns**
![img](pandas.assets/1dlpYc3VwGuv-R0iTu4mPzw.png)

**3. Adding a column**
![img](pandas.assets/1sagPinoTw4nfgiKK97OL1A.png)

**4. Fast element search**
![img](pandas.assets/16bWbS6mCvT-uiJsLR5e6ag.png)

**5. Joins by column**
![img](pandas.assets/1EbVEDx9maySLpDsdND7b5A.png)

**6. Grouping by column**
![img](pandas.assets/1VeV4Jw2kfF2Nt89ftmxi2A.png)

**7. Pivot tables**
![img](pandas.assets/120nnKdjRK3wEzjOvvAjusA.png)

At this point you might wonder why would anyone use NumPy if Pandas is so good. NumPy is not better or worse, it just has different use cases:

- Random numbers (e.g., for testing)
- Linear algebra (e.g., for neural networks)
- Images and stacks of images (e.g., for CNNs)
- Differentiation, integration, trigonometry and other scientific staff.

**Benchmarked** NumPy and Pandas 5–100 columns; 10³–10⁸ rows; integers and floats.
1 row and 100 million rows:
![img](pandas.assets/1xZD8Ky4Z2RM9Rax8Fdhfyg.png)


![img](pandas.assets/1ouJSVxgRrPJKt-c68W_2Hw.png)

# Part 2. Series and Index

![img](pandas.assets/1nYheMZJlvGo0UvIfcSRUrQ.png)

![img](pandas.assets/1zOZ6eZzgQ2gzidRLLo1f5A.png)

In [None]:
s = pd.Series([0.6, 0.7, 2.2, 10.5], index=["Athens", "Oslo", "Paris", "Bankok"])
s

Every element can be addressed in two alternative ways
![img](pandas.assets/1yFgpYI_tprMerCrOLRr6HQ.png)
- by ‘label’ (=using the index)
- by ‘position’ (=not using the index):

In [None]:
display(s[1:3])
display(s["Oslo":"Paris"])

if the labels happen to be integers, `s[1:3]` becomes ambiguous. 
![img](pandas.assets/1FlCqe7nGuygynyg0MfrczQ.png)

- by ‘label’ (=using the index=**iloc**)
- by ‘position’ (=not using the index=**loc**):

You can use a single or double colon with the familiar meaning of `start:stop:step`. As usual, missing start (end) means from the start (to the end) of the Series. The step argument allows to reference even rows with `s.iloc[::2]` and to get elements in reverse order with `s['Paris':'Oslo':-1]`

In [None]:
s.loc[::-1]

They also support boolean indexing

![img](pandas.assets/1gPF81TsMk8gbTvysONfJeA.png)

In [None]:
# !pip install pandas-illustrated

In [None]:
import pdi
pdi.patch_series_repr(footer=False)

In [None]:
s = pd.Series(["cat", "dog", "panda", "cat", "dragon"], index=list("abcde"), name="animal")
a = (s == "cat") | (s == "dog")
b = s.isin(["cat", "dog"])

pdi.sidebyside(a, b)

**fancy indexing** (indexing with an array of integers)
 
![img](pandas.assets/1aPEgTD8YUKDphJ-qC9e2gw.png)

## Index

当创建一个没有索引的序列(或DataFrame)时，Index会初始化为一个RangeIndex对象

In [None]:
import numpy as np

s = pd.Series(np.zeros(10**6))
display(s.index)
display(s.index.memory_usage())

In [None]:
s1 = s.drop(1)
display(s1.index)
display(s1.index.memory_usage())

In [None]:
s2 = s1.reset_index(drop=True)
display(s.index)

## Finding element by value

![img](pandas.assets/1Pv9iM9jJUuPqT5HGFI3WIw.png)

In [None]:
s = pd.Series([4, 2, 4, 6], index=["cat", "penguin", "dog", "butterfly"])
np.where(s==4)[0]
s.index[np.where(s==4)[0]]

## Missing values

![img](pandas.assets/1w2sNUkxiK_F4Y-bBCcpp_g.png)

In [None]:
s = pd.Series([1., None, 3.])
pdi.sidebyside(s, s.isna())

s.isna().sum() # if there are any nan

![img](pandas.assets/1WeVr1U1XGt157JPn3FF0fQ.png)

In [None]:
pdi.sidebyside(s, s.fillna(0), s.interpolate(), s.dropna())

On the other hand, you can keep using them. Most Pandas functions happily ignore the missing values:
![img](pandas.assets/11EzfnD-xr-0OknFkBOYxGQ.png)

Arithmetic operations are aligned against the `index`:

![img](pandas.assets/1Bln2ayx6iO3sGYzfd1N18Q.png)

In [None]:
s1 = pd.Series([1,2,3], index=list("abc"))
s2 = pd.Series([1,2,3], index=list("bcd"))
pdi.sidebyside(s1, s2, s1+s2)

## Comparisons

In [None]:
a = pd.Series([1., None, 3.]) == pd.Series([1., None, 3.])
b = pd.Series([1, None, 3], dtype='Int64') == pd.Series([1, None, 3], dtype='Int64')
pdi.sidebyside(a, b)
print(np.all(a), np.all(b))

To be compared properly

In [None]:
s1 = pd.Series([1., None, 3.])
s2 = pd.Series([1., None, 3.])

np.all(s1.fillna(np.inf) == s2.fillna(np.inf))

Or, better yet, use a standard NumPy or Pandas comparison function:

In [None]:
np.array_equal(s1.values, s2.values, equal_nan=True)

## Appends, inserts, deletions

![img](pandas.assets/1JHVUVZGTwMVfEn2_i-JOsg.png)

In [None]:
s = pd.Series(["cat", "dog", "horse"])
s1 = s.copy()
s1[1.5] = "panda"
s2 = s1.sort_index()
s3 = s2.reset_index(drop=True)
s4 = s3[s3 != "panda"]
pdi.sidebyside(s, s1, s2, s3, s4)

## Statistics

![img](pandas.assets/1nEHVgF88PYNnhK1q0YpY8A.png)

In [None]:
s = pd.Series([3, 7, 5], index=list("abc")) 
print(s.max(), s.mean(), s.median())
s.rolling(2).mean()


- `std`, sample standard deviation;
- `var`, unbiased variance;
- `sem`, unbiased standard error of the mean;
- `quantile`, sample quantile (`s.quantile(0.5) ≈ s.median()`);
- `mode`, the value(s) that appears most often;
- `nlargest` and `nsmallest`, by default, in order of appearance;
- `diff`, first discrete difference;
- `cumsum` and `cumprod`, cumulative sum, and product;
- `cummin` and `cummax`, cumulative minimum and maximum.
- `pct_change`, percent change between the current and previous element;
- `skew`, unbiased skewness (third moment);
- `kurt` or `kurtosis`, unbiased kurtosis (fourth moment);
- `cov,` `corr` and `autocorr`, covariance, correlation, and autocorrelation;
- [rolling](https://pandas.pydata.org/pandas-docs/stable/reference/window.html#rolling-window-functions), [weighted](https://pandas.pydata.org/pandas-docs/stable/reference/window.html#weighted-window-functions), and [exponentially weighted](https://pandas.pydata.org/pandas-docs/stable/reference/window.html#exponentially-weighted-window-functions) windows.

Since every element in a series can be accessed by either a label or a positional index, there’s a sister function for `argmin` (`argmax`) called `idxmin` (`idxmax`), which is shown in the image:

![img](pandas.assets/1qpCkFcrRKj8oN9qdmLqaAw.png)

In [None]:
print(s.idxmax(), s.argmax(), s.max())

## Duplicate data

![img](pandas.assets/1jlnjYL6OqoKzaNByBdkWnQ.png)

Missing values are treated as ordinary values, which may sometimes lead to surprising results.
There also is a family of monotonic functions with self-describing names:

- `s.is_monotonic_increasing`,
- `s.is_monotonic_decreasing`,
- `s._strict_monotonic_increasing`,
- `s._string_monotonic_decreasing`, and, quite unexpectedly,
- `s.is_monotonic` — this is a synonym for `s.is_monotonic_increasing` and returns `False` for monotonically decreasing series!

In [None]:
s1 = pd.Series(np.arange(5), index=list("abcde")) 
s2 = pd.Series([3, 7, 5], index=list("abc"))

print(s1.is_monotonic_increasing,
s1.is_monotonic_decreasing,
s2.is_monotonic_increasing)

## Strings and regular expressions

![img](pandas.assets/1IlVqCy1NEWM7h5BdWsbBcA.png)
When such an operation returns multiple values, you have several options for how to use them:
![img](pandas.assets/1t-Z8iW3HLbWlzkOkcy2NRg.png)

In [None]:
s = pd.Series(["e2-e4", "O-O-O", "d8Q"])
s.str.split("-", expand=True)

If you know regular expressions, Pandas has vectorized versions of the common operations with them, too:
![img](pandas.assets/1w1EXCEbBI9rzh3HJOZpzmw.png)

## Group by

![img](pandas.assets/1P6gMsfRIAUFmdZogIKIExA.png)

All operations exclude NaNs

![img](pandas.assets/1EVkkZyRXzZCCmQ-3jvM8dQ.png)

In addition to those aggregate functions, you can access particular elements based on their position or relative value within a group. Here’s what that looks like:

![img](pandas.assets/1mx1U5kHBwwBDeiCF-7U8zQ.png)
You can also calculate several functions in one call with `g.agg(['min', 'max'])` or display a whole bunch of stats functions at once with `g.describe()`.

In [None]:
s = pd.Series([1, 2, 10, 11, 15, 27])
g = s.groupby(s // 10)

pdi.sidebyside(g.agg(["min", "mean"]), g.describe())

![img](pandas.assets/1gnIPGT6TRq7R4DT6lhpGdg.png)

# Part 3. DataFrames

![img](pandas.assets/194hoswsoooSH_wVSEU0ilw.png)

## Reading and writing CSV files

![img](pandas.assets/1_ZYiMs0TJJmEmqCscMIwbA.png)

Since CSV doesn’t have a strict specification, sometimes it takes a bit of trial and error to read it correctly. What is cool about `read_csv` is that it automatically detects a lot of things, including:

- column names and types,
- representation of booleans,
- representation of missing values, etc.

![img](pandas.assets/1WvWU4gSz1c5VA5bQdYM4_Q.png)
![img](pandas.assets/1TwXX2K1Oj6u5lY4h_aP_vQ.png)

It is a good idea to set one or several columns as an index. The following image shows this process:

![img](pandas.assets/18y5tQge0RAohOcmzgIRHcQ.png)

`Index` has many uses in Pandas:

- it makes lookups by indexed column(s) faster;
- arithmetic operations, stacking, joining are aligned by index; etc.

In [None]:
df1 = pd.read_csv("data/courses.csv")
df2 = pd.read_csv("data/courses.csv", index_col="程")
display(df1.iloc[-4:])
display(df2.iloc[:4])

## Building a DataFrame

![img](pandas.assets/1uY9uygYVTsbOeZMkyvLMKA.png)

using the `columns` argument and  the `index` argument

![img](pandas.assets/1ecpiKwydlLWNJiiDwDx45A.png)

The next option is to construct a DataFrame from a dict of NumPy vectors or a 2D NumPy 

![img](pandas.assets/1fBawUWTps5MmwwJtKgbGdw.png)

In [None]:
d = np.array([[698660, 480.8],
             [1911191, 414.8],
             [14043239, 2194.1]])

df = pd.DataFrame(d, columns=["population", "area"], 
                  index=["Oslo", "Vienna", "Tokyo"])
df.index.name= "city_name"
df

## Basic operations with DataFrames

![img](pandas.assets/10PxEM1tcVdK-Cc71CgHWKA.png)

In [None]:
df["density"] = df["population"] / df["area"]
df

## Indexing DataFrames

![img](pandas.assets/1i2l47j_-P6qRAXEu6ClVAg.png)

dataframes, just like series, have two alternative indexing modes: `loc` for indexing by labels and `iloc` for indexing by positional index.

![img](pandas.assets/1XHIJAm2Zej0W38bGRgwbTw.png)

In [None]:
df = pd.DataFrame(np.arange(1, 13).reshape(3, -1), 
                 index=list("abc"), columns=list("ABCD"))
display(df)
df.loc[:, "B"] = 10
display(df)

In [None]:
df.loc[["b", "c"], "A":"C"] = 99
df

![img](pandas.assets/14slwo8GXp5wB6HyFoXbLvg.png)

In [None]:
d = np.array([["Oslo", 698660, 480.8],
             ["Vienna", 1911191, 414.8],
             ["Tokyo", 14043239, 2194.1]])

df = pd.DataFrame(d, columns=["name", "population", "area"])

df.loc[df["population"].astype(int) > 10 ** 6, "name"]

When using several conditions, they must be parenthesized, as you can see below:

![img](pandas.assets/1wDUggm9J1vpZwS4rLlmlYg.png)

When you expect a single value to be returned, you need special care.

![img](pandas.assets/1yGIyTCHcJ2AjeFXKnn4F-Q.png)

## DataFrame arithmetic

All arithmetic operations are aligned against the row and column labels:
![img](pandas.assets/1M3HuRKFHEi8Ao-NxcOBmHA.png)

In mixed operations between DataFrames and Series, the Series (God knows why) behaves (and broadcasts) like a row-vector and is aligned accordingly:
![img](pandas.assets/1SqbcZB8UygeM8XFuZE2p5g.png)

Probably to keep in line with lists and 1D NumPy vectors (which are not aligned by labels and are expected to be sized as if the DataFrame was a simple 2D NumPy array):

![img](pandas.assets/1a7VaVbk1R3z05jCgL4R-jQ.png)

So, in the unlucky (and, by coincidence, the most usual!) case of dividing a dataframe by a column-vector series, you have to use methods instead of the operators, as you can see below:

![img](pandas.assets/1TjSonUYkRaC7b6-BExLVIA.png)

![img](pandas.assets/1EyEn4Hp0dSq3jRMGwisiXg.png)

## Vertical stacking

This is probably the simplest way to combine two or more dataframes into one: you take the rows from the first one and append the rows from the second one to the bottom. To make it work, those two dataframes need to have (roughly) the same columns. This is similar to `vstack` in NumPy, as you can see in the image:

![img](pandas.assets/1U0MaT2LtC9ObBOrlLXyL0A.png)

In [None]:
df1 = pd.read_csv("data/courses.csv").iloc[:3]
df2 = pd.read_csv("data/courses.csv").iloc[-3:]
pdi.sidebyside(pd.concat([df1, df2]), 
               pd.concat([df1, df2], ignore_index=True))

![img](pandas.assets/1SyW83FIGHBKgMGyNmTJ0Ug.png)

## Horizontal stacking

![img](pandas.assets/1lS1W9lUTqAu3F6vjCdqf7Q.png)

## 1:1 Relationship joins

![img](pandas.assets/1mEANaOr9HRR69IuMGn1esg.png)

If the column is already in the index, you can use `join`

![img](pandas.assets/1qhqIn9Aw51jCwSA4wrH4Lw.png)

If the column you want to merge on is not in the index, use `merge.`

![img](pandas.assets/1BQLUmwoXCosorw-1eWT-7w.png)

## 1:n Relationship joins

![img](pandas.assets/17Q-lAZNYhTRLxMEcG6wdUQ.png)

Now, if the column to merge on is already in the index of the right DataFrame, use `join` (or `merge` with `right_index=True`, which is exactly the same thing):

![img](pandas.assets/1lnCpYttPfSsp2KImx1PZtA.png)

join() does left outer join by default

Sometimes, joined dataframes have columns with the same name.

![img](pandas.assets/1MquOSCrtVO9Vj18-V91hSQ.png)

## Multiple joins

![img](pandas.assets/173WqDpx0NuSHgNHZnaNLLQ.png)

## Inserts and deletes

![img](pandas.assets/17XRaXsl0ytSI8IPDXauUfg.png)

In [None]:
df = pd.DataFrame(np.arange(1, 7).reshape(2, -1),
                 columns=list("ABC"),
                 index=list("ab"))
df1 = pd.DataFrame(np.insert(df.values, 1, values=[7, 8, 9], axis=0))
df.insert(1, "D", [7, 8])
pdi.sidebyside(df, df1)

Deleting columns is usually worry-free, except that `del df['D']` works while `del df.D` doesn’t (limitation on the Python level).
![img](pandas.assets/146V9fMGZv4UrCT3gW2-m7Q.png)

In [None]:
df = pd.DataFrame(np.arange(1, 10).reshape(3, -1),
                 columns=list("ABC"),
                 index=list("abc"))
df1 = df.drop(index="b", inplace=False)
df2 = df.drop(columns="B", inplace=False)
pdi.sidebyside(df, df1, df2)

## Group by

![img](pandas.assets/1J7n3A7uLJ55L0CFGHF2EhA.png)

![img](pandas.assets/1Lbqy_M1eTJWuoarH814Qyg.png)