In [460]:
import pandas as pd

# 5.1 Introduction to pandas Data Structures

### Series

In [461]:
obj = pd.Series([4, 7, -5, 3])

In [462]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

Getting the array reprensentation and the index

In [463]:
obj.array


<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [464]:
obj.index


RangeIndex(start=0, stop=4, step=1)

Series with and Index Identifying each data point with a lable.

In [465]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

Notice that index returns not the iterator but the index

In [466]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

So I can pull them out like its a dict

In [467]:
obj2["a"]

-5

I can easily update values too

In [468]:
obj2["d"] = 6

A list of indices

In [469]:
obj2[["c", "a", "d"]]

c    3
a   -5
d    6
dtype: int64

You can also do filtering.

In [470]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

Even Math

In [471]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

See here as we import something, but give it a value too

In [472]:
import numpy as np

In [473]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Hey look it's our good friend, the in command.

In [474]:
"b" in obj2

True

In [475]:
"e" in obj2

False

lets make a new series using a dict

In [476]:
In [30]: sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

obj3 = pd.Series(sdata)

obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

We can also reverse this by using to_dict()

In [477]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

Can reorder keys with an index pass.
- Also notice that it converts into a float64, because a value is Null? Interesting.

In [478]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

isna == isNull, just it's called isna in panda

In [479]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

notna == not NULL basically

In [480]:
pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also have these instance methods, so we can shorten this to:

In [481]:
obj4.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Automatically aligns by index in arithmitic operations

In [482]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [483]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Notice how it converted Utah to NaN because NaN + 5000 is not possible

In [484]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Can also give it names

In [485]:
obj4.name = "population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

Can alter an index by assignment

In [486]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [487]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

# DataFrame

Dataframes have a row and column index, just think of it like a dict of Series all sharing the same index.

In [488]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2



"Hey that's pretty good"


Head selects the first five rows


In [489]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


And tail does the opposet, with the last five rows

In [490]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


can also specify columns too.

In [491]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


And if that column doesn't exist? It still appears, just empty.

In [492]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [493]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a Dataframe can be retrieved as a Series by dict like notation or dot attribute notation.

In [494]:
frame2["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [495]:
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Rows are gathered by `iloc` or `loc`

In [496]:
frame2.loc[1]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

In [497]:
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

You can modify all a column at once too.

In [498]:
frame2["debt"] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


Or you can give it a scalar value, or an array of values.

In [499]:
frame2["debt"] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


Do keep in mind that when assigning lists or arrays to a column the length must match the length of the DataFrame.

If you assign a Series, it's lables will be realigned to the DataFrames's index, inserting missing values in any index not present 

In [500]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
frame2["debt"] = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


Assigning a column that doesn’t exist will create a new column. Dot notation CAN NOT CREATE NEW COLUMNS

The 'del' keyword will delete columns like with a dictionary. As an example, I first add a new column of Boolean values where the state column equals "Ohio":

In [501]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,,False


In [502]:
del frame2["eastern"]
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

- The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series’s copy method.

nested dicts are common too. if a nested dict is passed to DataFrame, pandas will interpret the outer dictionary keys as columns, and the inner as indices

In [503]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

- Note that transposing discards the column data types if the columns do not all have the same data type, so transposing and then transposing back may lose the previous type information. The columns become arrays of pure Python objects in this case.



In [504]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


The keys in the inner dictionaries are combined to form the index in the result. But you can give it an explicit index.

In [505]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


Dictionaries of Series are treated in much the same way:

In [506]:
pdata = {"Ohio": frame3["Ohio"][:-1],
        "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


If a DataFrame’s `index` and `columns` have their `name` attributes set, these will also be displayed:

In [507]:
frame3.index.name = "year"
frame3.columns.name = "state"
frame3


state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


Now DataFrame does NOT have it's own `name` attribute, we use `to_numpy` to return the data as a two dimensional ndarray.


In [508]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

In [509]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

# Index Objects
panda's Index Objects are responsable for holding the axis labels (this includes DataFrame's column names) and other metadata. This is what that .index thing is all about.

In [510]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [511]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and can't be modified by the user.

This makes it safer to share index objects amoung data structures.

In [512]:
index[1] = "d"

TypeError: Index does not support mutable operations

In [None]:
labels = pd.Index(np.arange(3))
labels

In [None]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

In [None]:
obj2.index is labels

- Some users will not often take advantage of the capabilities provided by an Index, but because some operations will yield results containing indexed data, it's important to understand how they work.

In addition to being array-like, an Index also behaves like a fixed-size set:



In [None]:
frame3

In [None]:
frame3.columns

In [None]:
"Ohio" in frame3.columns

In [None]:
2003 in frame3.index

Unlike Python sets, a pandas Index can contain duplicate labels.

Selections with duplicate labels will select all occurrences of that label.

In [None]:
pd.Index(["foo", "foo", "bar", "bar"])

# Essential Functionality

It's important to note the ways you can interact with the data as well. For example, Reindexing.

`reindex` creates a new object with the values rearranged to alighn with the new index.

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

Now that we have this data, let's `reindex` it.

In [None]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

For ordered data like time series, you may want to do some interpolation or filling of values when reindexing. The `method` option allows us to do this, using a method such as `ffill`, which forward-fills the values:



In [None]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3

In [None]:
obj3.reindex(np.arange(6), method="ffill")

I bet you saw this comming, `reindex` can also alter the rows. That's right, rows, columns, or both.

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
frame

When passed only a sequence, it reindexes the rows in the result:

In [None]:
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

reindexing with the `columns` keyword:
- Notice how Utah Doesn't have any values so it doesn't full it out. Ohio is dropped as well, since it isn't included in states.

In [None]:
frame

In [None]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Another way to reindex a particular axis is to pass the new axis labels as a positional argument and then specify the axis to reindex with the `axis` keyword:

In [None]:
frame.reindex(states, axis="columns")

You can also reindex by using the `loc` operator, and many users prefer to always do it this way. This works only if all of the new index labels already exist in the DataFrame (whereas `reindex` will insert missing data for new labels):

In [None]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

# Dropping Entries from an Axis
Dropping one or more entries from an axis is simple if you already have an index array or list without those entries, since you can use the `reindex` method or `.loc`-based indexing. As that can require a bit of munging and set logic, the `drop` method will return a new object with the indicated value or values deleted from an axis:


In [None]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj

Now without "c"

In [None]:
new_obj = obj.drop("c")
new_obj

You get the idea, removing "d" and "c"

In [None]:
obj.drop(["d", "c"])

With Dataframe, index values can be deleted from either Axis.

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

Calling `drop` with a sequence of labels will drop values from the row labels (axis 0):

In [None]:
data.drop(index=["Colorado", "Ohio"])

To drop labels from the columns, instead use the `columns` keyword:

In [None]:
data.drop(columns=["two"])

You can also drop values from the columns by passing `axis=1` (which is like NumPy) or `axis="columns"`:

In [None]:
data.drop("two", axis=1)

In [None]:
data.drop(["two", "four"], axis="columns")

# Indexing, Selection, and Filtering
Series indexing (`obj[...]`) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [None]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj

In [None]:
obj["b"]

In [None]:
obj[1]

In [None]:
obj[2:4]

In [None]:
obj[["b", "a", "d"]]

In [None]:
obj[[1, 3]]

In [None]:
obj[obj < 2]

While you can select data by label this way, the preferred way to select index values is with the special `loc` operator:

In [None]:
obj.loc[["b", "a", "d"]]

The reason to prefer `loc` is because of the different treatment of integers when indexing with `[]`. 
Regular `[]`-based indexing will treat integers as labels if the index contains integers, so the behavior differs depending on the data type of the index. For example:


In [None]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj1

In [None]:
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
obj2

In [None]:
obj1[[0, 1, 2]]

In [None]:
obj2[[0, 1, 2]]

When using `loc`, the expression `obj.loc[[0, 1, 2]]` will fail when the index does not contain integers:

In [None]:
obj2.loc[[0, 1]]

Since `loc` operator indexes exclusively with labels, there is also an `iloc` operator that indexes exclusively with integers to work consistently whether or not the index contains integers:

In [None]:
obj1.iloc[[0, 1, 2]]

In [None]:
obj2.iloc[[0, 1, 2]]

- You can also slice with labels, but it works differently from normal Python slicing in that the endpoint is inclusive:



In [None]:
obj2.loc["b":"c"]

Can also assign values with this


In [None]:
obj2.loc["b":"c"] = 5
obj2

 - Do note that both `loc` and `iloc` are NOT functions, and they need to be indexed into with `[]` this notation is used to enable slice operations and to allow for indexing on multiple axes with DataFrame objects
 
Indexing into a DataFrame retrieves one or more columns either with a single value or sequence:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

In [None]:
data["two"]

In [None]:
data[["three", "one"]]

Indexing like this has a few special cases. The first is slicing or selecting data with a Boolean array:

In [None]:
data[:2]

In [None]:
data[data["three"] > 5]

The row selection syntax `data[:2]` is provided as a convenience. Passing a single element or a list to the `[]` operator selects columns.

Another use case is indexing with a Boolean DataFrame, such as one produced by a scalar comparison. Consider a DataFrame with all Boolean values produced by comparing with a scalar value:

In [None]:
data < 5

We can use this DataFrame to assign the value 0 to each location with the value `True`, like so:

In [None]:
data[data < 5] = 0
data

### Selection on DataFrame with loc and iloc


Like Series, DataFrame has special attributes `loc` and `iloc` for label-based and integer-based indexing, respectively. Since DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either axis labels (`loc`) or integers (`iloc`).



In [None]:
data

In [None]:
data.loc["Colorado"]

Naturally we can do multiples as well

In [None]:
data.loc[["Colorado", "New York"]]

Can combine these row and column selections too. Behold:

In [None]:
data.loc["Colorado",["two", "three"]]

Now you can do the same with `iloc` as well, but you need to give it integers

In [None]:
data.iloc[2]

In [None]:
data.iloc[[2, 1]]

In [None]:
data.iloc[2,[3, 0, 1]]

In [None]:
data.iloc[[1, 2],[3, 0, 1]]

Both indexing functions work with slices in addition to single labels or lists of labels:

In [None]:
data.loc[:"Utah", "two"]

In [None]:
data.iloc[:, :3][data.three > 5]

Boolean arrays can be used with `loc` but not `iloc` Which makes sense since `iloc` is strictly Integers :

In [None]:
data.loc[data.three >= 2]

In [None]:
data.iloc[data.three >= 2]

### Integer indexing pitfalls


Working with pandas objects indexed by integers can be a stumbling block for new users since they work differently from built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:



In [None]:
ser = pd.Series(np.arange(3.))
ser

In [None]:
ser[-1]

In this case, pandas could “fall back” on integer indexing, but it is difficult to do this in general without introducing subtle bugs into the user code. Here we have an index containing `0`, `1`, and `2`, but pandas does not want to guess what the user wants (label-based indexing or position-based):

In [None]:
ser

On the other hand, with a noninteger index, there is no such ambiguity: 

In [None]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])
ser2[-1]

If you have an axis index containing integers, data selection will always be label oriented. As I said above, if you use `loc` (for labels) or `iloc` (for integers) you will get exactly what you want:



In [None]:
ser.iloc[-1]

In [None]:
ser[:2]

As a result of these pitfalls, it is best to always prefer indexing with `loc` and `iloc` to avoid ambiguity.



## Pitfalls with chained indexing
In the previous section we looked at how you can do flexible selections on a DataFrame using `loc` and `iloc`. These indexing attributes can also be used to modify DataFrame objects in place, but doing so requires some care.

For example, in the example DataFrame above, we can assign to a column or row by label or integer position:



In [None]:
data.loc[:, "one"] = 1
data

Okay so see how Utah Changes to 5?

In [None]:
data.iloc[2] = 5
data

Okay this one is a little bit more complicated. Basically it is saying that:
 - In data column four, if it is greater then five, make that entire row 3

In [None]:
data.loc[data["four"] > 5] = 3
data

Trying to chain selections

In [None]:
data.loc[data.three == 5]["three"] = 6

In [None]:
data.loc[data.three == 5, "three"] = 6
data

Notice how the 5 in column "three" is now a 6

It's a good idea to avoid chaining stuff when doing asssignments.

## Arithmetic and Data Alignment
pandas can make it much simpler to work with objects that have different indexes. For example, when you add objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at an example:

In [None]:
In [182]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], 
                         index=["a", "c", "d", "e"])

In [183]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
                         index=["a", "c", "e", "f", "g"])

In [None]:
s1

In [None]:
s2

Adding them now

In [None]:
s1 + s2

For DataFrames, this alignment is performed on both rows and columns

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])


In [None]:
df1

In [None]:
df2

Now let's make a union of them

In [None]:
df1 + df2

Since the `"c"` and `"e"` columns are not found in both DataFrame objects, they appear as missing in the result. The same holds for the rows with labels that are not common to both objects.

If you add DataFrame objects with no column or row labels in common, the result will contain all nulls:

In [None]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})

In [None]:
df1

In [None]:
df2

In [None]:
df1 + df2

### Arithmetic methods with fill values
Hey remember fill values, well they remember you.

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other. Here is an example where we set a particular value to NA (null) by assigning `np.nan` to it:

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list("abcd"))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list("abcde"))

df2.loc[1, "b"] = np.nan

In [None]:
df1

In [None]:
df2

Here when we add them, the missing value of `e` and `3` become `NaN` because they have no overlap

In [None]:
df1 + df2

Using the `add` method on df1, I pass `df2` and an argument to `fill_value`, which substitutes the passed value for any missing values in the operation:

In [None]:
df1.add(df2, fill_value=0)

See Table 5.5 for a listing of Series and DataFrame methods for arithmetic. Each has a counterpart, starting with the letter `r`, that has arguments reversed. So these two statements are equivalent:



In [None]:
1 / df1

In [None]:
df1.rdiv(1)

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

#### Operations between DataFrame and Series
That's right we can also do math between DatFrame and Series Consider the diference between a two dimensional array and one of it's rows

In [None]:
arr = np.arange(12.).reshape((3, 4))
arr

In [None]:
arr[0]

In [None]:
arr - arr[0]

- So this math here is preformed on each row. This is called `broadcasting`

Operations between DataFrames and Series are similar to this

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])

series = frame.iloc[0]

In [None]:
frame

In [None]:
series

By default, arithmetic between DataFrame and Series matches the index of the Series on the columns of the DataFrame, broadcasting down the rows:

In [None]:
frame - series


If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:

In [None]:
series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
series2

In [None]:
frame + series2

If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods and specify to match over the index. For example:

In [None]:
series3 = frame["d"]
frame


In [None]:
series3

In [None]:
frame.sub(series3, axis="index")

## Function Application and Mapping
NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [None]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])

frame

Using that np variable we imported all those lines ago.

In [None]:
np.abs(frame)

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s `apply` method does exactly this:

In [None]:
def f1(x):
    return x.max() - x.min()

frame.apply(f1)


Here the function `f`, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in `frame`. The result is a Series having the columns of `frame` as its index.

If you pass `axis="columns"` to `apply`, the function will be invoked once per row instead. A helpful way to think about this is as "apply across the columns":


In [None]:
frame.apply(f1, axis="columns")

Many of the most common array statistics (like `sum` and `mean`) are DataFrame methods, so using `apply` is not necessary.

The function passed to `apply` need not return a scalar value; it can also return a Series with multiple values:

In [None]:
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

frame.apply(f2)

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in `frame`. You can do this with `applymap`:

In [None]:
def my_format(x):
    return f"{x:.2f}"

frame.applymap(my_format)

The reason for the name `applymap` is that Series has a `map` method for applying an element-wise function:

In [None]:
frame["e"].map(my_format)

## Sorting and Ranking
Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column label, use the `sort_index` method, which returns a new, sorted object:

In [None]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj

In [None]:
obj.sort_index()

With a DataFrame, you can sort by index on either axis:

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=["three", "one"],
                     columns=["d", "a", "b", "c"])

frame

In [None]:
frame.sort_index()


In [None]:
frame.sort_index(axis="columns")

The data can even be made to be sorted in decending order too:

In [None]:
frame.sort_index(axis="columns", ascending=False)

Can even sort a series by its values with the power of the `sort_values` method

In [None]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

If values are missing then they get sorted onto the end by default.

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

You know it's coming, we got a option called `na_position` that sticks them on the front.

In [None]:
obj.sort_values(na_position="first")

When sorting a DataFrame, you can use the data in one or more columns as the sort keys, by passing those column names to `sort_values`:

In [None]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame

Check it out, All the columns where sorted based on B

In [None]:
frame.sort_values("b")

And of course we can do multiple columns.

Here you can see how it goes to the next column if there is a duplicate in the first.

In [None]:
frame.sort_values(["a", "b"])

Ranking assigns ranks from one through the number of valid data points in an array, starting from the lowest value. The `rank` methods for Series and DataFrame are the place to look; by default, `rank` breaks ties by assigning each group the mean rank:

 - Basically it gets the average between the data points, so it adds 7 and 5, then divides them by 2. It then will add a .5 if there is overlap.

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

Ranks can also be assigned according to the order in which they’re observed in the data:

In [None]:
obj.rank(method="first")

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have been set to 6 and 7 because label 0 precedes label 2 in the data.

You can rank in descending order, too:

In [None]:
obj.rank(ascending=False)

DataFrame can compute ranks over the rows or the columns:

In [None]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
                      "c": [-2, 5, 8, -2.5]})

frame

In [None]:
frame.rank(axis="columns")

## Axis Indexes with Duplicate Labels
Up until now almost all of the examples we have looked at have unique axis labels (index values). While many pandas functions (like `reindex`) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [None]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
obj

The `is_unique` property of the index can tell you whether or not its labels are unique:

In [None]:
obj.index.is_unique

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:

In [None]:
obj["a"]

In [None]:
obj["c"]

This can make your code more complicated, as the output of indexing can vary based on if it is repeating or not.

The same logic extends to indexing rows (or columns) in a DataFrame:

In [None]:
df = pd.DataFrame(np.random.standard_normal((5, 3)),
                  index=["a", "a", "b", "b", "c"])
df


In [None]:
df.loc["b"]


In [None]:
df.loc["c"]

## Summarizing and Computing Descriptive Statistics
pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of `reductions` or `summary statistics`, methods that extract a single value (like the sum or mean) from a Series, or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame:



In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])

df

Calling DataFrame's `sum` method returns a series cotaining the column sums:

In [None]:
df.sum()

Okay but what if we want `sum` across columns? Yeah Dataframe has us covered there too with `axis="columns"` or `axis=1`

In [None]:
df.sum(axis="columns")

As seen with `"c"`, if the entire row or column contains `NaN` values the sum is `0`

- But if one of the values is not `NaN` then it returns a `NaN`

We can skip the `NaN` values with `skipna`

In [None]:
df.sum(axis="index", skipna=False)

In [None]:
df.sum(axis="columns", skipna=False)

Some aggregations, like `mean`, require at least one non-NA value to yield a value result, so here we have:

In [None]:
df.mean(axis="columns")

Some methods, like `idxmin` and `idxmax`, return indirect statistics, like the index value where the minimum or maximum values are attained:

In [None]:
df.idxmax()

Other methods are `accumulations`:



In [None]:
df.cumsum()

Some methods are neither reductions nor accumulations. `describe` is one such example, producing multiple summary statistics in one shot:



In [None]:
df.describe()


On nonnumeric data, `describe` produces alternative summary statistics:



In [None]:
obj = pd.Series(["a", "a", "b", "c"] * 4)
obj.describe()


### Correlation and Covariance
Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes originally obtained from Yahoo! Finance and available in binary Python pickle files you can find in the accompanying datasets for the book:

In [None]:
price = pd.read_pickle("examples/yahoo_price.pkl")
volume = pd.read_pickle("examples/yahoo_volume.pkl")


Let's compute percent changes of the prices

In [None]:
returns = price.pct_change()
returns.tail()

The `corr` method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, `cov` computes the covariance:

In [None]:
returns["MSFT"].corr(returns["IBM"])

In [None]:
returns["MSFT"].cov(returns["IBM"])

Since `MSFT` is a valid Python variable name, we can also select these columns using more concise syntax:

In [None]:
returns["MSFT"].corr(returns["IBM"])

DataFrame’s `corr` and `cov` methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively:

In [None]:
returns.corr()

In [None]:
returns.cov()

Using DataFrame’s `corrwith` method, you can compute pair-wise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column:

In [None]:
returns.corrwith(returns["IBM"])

Passing a DataFrame computes the correlations of matching column names. Here, I compute correlations of percent changes with volume:

In [None]:
returns.corrwith(volume)

Passing `axis="columns"` does things row-by-row instead. In all cases, the data points are aligned by label before the correlation is computed.

## Unique Values, Value Counts, and Membership
Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:

In [513]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

The first function is `unique`, which gives you an array of the unique values in a Series:



In [514]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

The unique values are not necessarily returned in the order in which they first appear, and not in sorted order, but they could be sorted after the fact if needed (`uniques.sort()`). Relatedly, `value_counts` computes a Series containing value frequencies:

In [515]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

The Series is sorted by value in descending order as a convenience. `value_counts` is also available as a top-level pandas method that can be used with NumPy arrays or other Python sequences:



In [517]:
pd.value_counts(obj.to_numpy(), sort=False)

c    3
a    3
d    1
b    2
dtype: int64

`isin` performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame:



In [518]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [520]:
mask = obj.isin(["b","c"])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

And we can just get the column by doing this:

In [521]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

Related to `isin` is the `Index.get_indexer` method, which gives you an index array from an array of possibly nondistinct values into another array of distinct values:

In [522]:
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])
unique_vals = pd.Series(["c", "b", "a"])
indices = pd.Index(unique_vals).get_indexer(to_match)

indices

array([0, 2, 1, 1, 0, 2])

Ya like making Histograms on multiple related columns in a DataFrame? Here is how:

In [523]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})

data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


We can compute the value counts for a single column, like so:

In [524]:
data["Qu1"].value_counts().sort_index()

1    1
3    2
4    2
Name: Qu1, dtype: int64

To compute this for all columns, pass `pandas.value_counts` to the DataFrame’s `apply` method:

In [525]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


Here, the row labels in the result are the distinct values occurring in all of the columns. The values are the respective counts of these values in each column.

There is also a `DataFrame.value_counts` method, but it computes counts considering each row of the DataFrame as a tuple to determine the number of occurrences of each distinct row:



In [526]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})
data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [527]:
data.value_counts()

a  b
1  0    2
2  0    2
1  1    1
dtype: int64