# How Pandas Work Under the Hood

As with any program language, it’s important to understand what is going on underneath because it helps you write more explicit, simpler, performant, and correct code.

Pandas is a wrapper around NumPy and NumPy is a wrapper around C; thus, pandas gets its performance from running things in C and not in Python. This concept is fundamental to everything you do in pandas. When you are in C, you are fast, and when you are in Python, you are slow.

The same requirements present for working with NumPy arrays hold true when working with pandas DataFrames—namely, the Python code must be translatable to C code; this includes the types that hold the data and the operations performed on the data.

Here is a table of pandas types to NumPy types. Note that `datetime`s and `timedelta`s don’t translate into NumPy types. This is because C does not have a datetime data structure, and so in cases where operations must be made on `datetime` data, it is more performant to, instead, convert the `datetime`s to an `int` type of seconds since the epoch.

|pandas type | NumPy type|
|:--|:--|
|`object` | `string_`, `unicode_`|
|`int64` | `int_`, `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, `uint64`|
|`float64` | `float_`, `float16`, `float32`, `float64`|
|`bool` | `bool_`|
|`datetime64` | `datetime64[ns]`|
|`timedelta[ns]` | N/A|
|`category` | N/A|

Note that `category` is also not translatable into C. `category` is similar to a `tuple` in that it is intended to hold a collection of categorical variables, meaning metadata with a fixed unique set of values. Because it’s not translatable into C, it should never be used to hold data that needs to be analyzed. Its advantage mainly comes in its ability to sort things in a custom sort order efficiently and simply. Underneath it looks like a data array of indexes where the indexes correspond to a unique value in an array of categories. The documentation claims that it can result in a huge memory savings when using string categories. Of course, we know  that Python already has a built-in string cache that does that for us automatically for certain strings so this would really only make a difference if the strings contained characters other than alphanumeric and underscore.

Below shows an example of a `category` and its representation in memory. Note that it uses integers to represent the value and those integers map to an `index` in the `category` array. This is a common method of conserving memory in pandas. We’ll run into this again later when we look at multi-indexing.

In [30]:
import pandas as pd
produce = pd.Series(
     ["apple", "banana", "carrot", "apple"], dtype="category"
)

In [31]:
produce

0     apple
1    banana
2    carrot
3     apple
dtype: category
Categories (3, object): ['apple', 'banana', 'carrot']

Operations must also be translatable into C in order to take advantage of NumPy’s performance optimizations. This means custom functions like the one below will not be performant because they will run in Python and not in C. We’ll dig more into this example and the apply function specifically in later sections.

In [35]:
def grade(values):
    if 70 <= values["score"] < 80:
        values["score"] = "C"
    elif 80 <= values["score"] < 90:
        values["score"] = "B"
    elif 90 <= values["score"]:
        values["score"] = "A"
    else:
        values["score"] = "F"
    return values

In [36]:
scores = pd.DataFrame({"score": [89, 70, 71, 65, 30, 93, 100, 75]})

In [38]:
scores.apply(grade, axis=1)

Unnamed: 0,score
0,B
1,C
2,C
3,F
4,F
5,A
6,A
7,C
