In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.DataFrame({
    'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])
})

print(df)

        one       two     three
a  0.146475 -0.363166       NaN
b -0.001427 -0.054496 -1.033983
c -1.113632  0.221810  2.687847
d       NaN -0.081181  0.919779


# Boolean Operations

## Boolean Comparisons

- eq (== equals to)
- ne (!= not equals to)
- le (<= less than or equals to)
- lt (< less than)
- ge (>= greater than or equals to)
- gt (> greater than)

In [4]:
# create a copy of the dataframe
df2 = df.copy()

# 
print(df.gt(df2), '\n')
print(df.ne(df2), '\n')
print(np.nan == np.nan)

     one    two  three
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False 

     one    two  three
a  False  False   True
b  False  False  False
c  False  False  False
d   True  False  False 

False


## Reductions to Summarize Boolean Results

Reductions that can be applied per column using the methods:
- empty
- any()
- all()
- bool()

In [5]:
print((df > 0).all(), '\n')
print((df > 0).any(), '\n')
print((df > 0).any().any())

one      False
two      False
three    False
dtype: bool 

one      True
two      True
three    True
dtype: bool 

True


## Signle-Element Pandas Objects in a Boolean Context

A Pandas Series or Dataframe behaves as expected when holding only 1 element.

In [6]:
print(pd.Series([True]).bool(), '\n')
print(pd.Series([False]).bool(), '\n')
print(pd.DataFrame([True]).bool(), '\n')
print(pd.DataFrame([True]).bool(), '\n')

True 

False 

True 

True 



## Objects Comparison

Perform element-wise comparisons when comparing a pandas data structure with a scalar value or with other array-like objects of the same length.

In [7]:
print(pd.Series(['foo', 'bar', 'baz']) == 'foo', '\n')
print(pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux']))

0     True
1    False
2    False
dtype: bool 

0     True
1     True
2    False
dtype: bool


In [8]:
print((df + df == df * 2).all(), '\n')
print((df + df).equals(df * 2)) # equals() treats NaNs as equals, unlike the basic boolean operations

one      False
two       True
three    False
dtype: bool 

True


# Descriptive Statistics

There exists a large number of methods for computing descriptive statistics and other related operations on Pandas Series & DataFrame sets. All of them are also vectorized. And most are aggregations and produce a lower-dimensional result.

Generally speaking, these methods take an axis as an argument and the axis can be specified by name or integer.

![Descriptive Statistics in Pandas](https://i.imgur.com/ZDOek2T.png)

In [9]:
# aggregation for each column
print(df.mean(0), '\n')

# aggregation for each index
print(df.mean(1))

one     -0.322861
two     -0.069258
three    0.857881
dtype: float64 

a   -0.108345
b   -0.363302
c    0.598675
d    0.419299
dtype: float64


By applying vectorized operations, we can describe various statisticasl procedures, like *standardization (rendering data zero mean and standard deviation 1).

In [10]:
ts_stand = (df - df.mean()) / df.std()
ts_stand.std()

one      1.0
two      1.0
three    1.0
dtype: float64

## Describe

There is a convenient `describe()` function which computes a variety of summary statistics about a `Series` or the columns of a `DataFrame`:

In [11]:
series = pd.Series(np.random.randn(1000))
series[::2] = np.nan
series.describe()

count    500.000000
mean      -0.002538
std        0.957866
min       -2.708319
25%       -0.650090
50%        0.001456
75%        0.644395
max        2.579643
dtype: float64

In [13]:
frame = pd.DataFrame(
    np.random.randn(1000, 5),
    columns=['a', 'b', 'c', 'd', 'e'])
frame.iloc[::2] = np.nan
frame.describe()

Unnamed: 0,a,b,c,d,e
count,500.0,500.0,500.0,500.0,500.0
mean,0.008352,0.000572,-0.03349,0.010853,-0.012399
std,1.064142,1.011945,1.003456,1.005942,0.972037
min,-4.11109,-3.615698,-3.096239,-2.739371,-2.471437
25%,-0.678935,-0.646975,-0.668141,-0.655901,-0.700984
50%,0.025971,0.04738,-0.062987,0.011205,-0.049313
75%,0.738701,0.639458,0.644879,0.62059,0.66062
max,3.185961,3.027164,4.054239,3.27099,2.973112


The `count` represents the number of actual numbers. While the random function generated 1000 numbers per column, half were replaced by the slice indidated on line 4.

For a non-numerical Series object, `describe()` will give a simple summary of the number of unique values and the most frequently occurring values:

In [14]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

## Index of Min/Max Values

The `idxmin()` and `idxmax()` functions on `Series` and `DataFrame` compute the index labels with the minimum and maximum corresponding values:

In [16]:
s1 = pd.Series(np.random.randn(5))
s1

0    0.664493
1    0.672600
2    2.446050
3    0.555100
4   -1.341844
dtype: float64

In [17]:
s1.idxmin(), s1.idxmax()

(4, 2)

In [18]:
df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df1

Unnamed: 0,A,B,C
0,-0.140097,0.123954,-1.357897
1,-0.30911,1.855777,-0.596847
2,-2.296176,-1.343464,-0.911371
3,0.61574,-0.387936,-0.712033
4,-1.043531,0.443584,0.307695


In [19]:
df1.idxmin(axis=0)

A    2
B    2
C    0
dtype: int64

In [20]:
df1.idxmax(axis=1)

0    B
1    B
2    C
3    A
4    B
dtype: object

## Iter