In [64]:
import numpy as np
import pandas as pd

# Pandas Basics II

We will continue with the practice of basic operations in Pandas. To extend what we're already learned, we'll cover:

    Boolean comparisons
    Objects comparison
    Descriptive statistics
    Iterations


## Boolean Comparisons

Series and DataFrame have the binary comparison methods eq, ne, le, lt, ge, and gt whose behavior is vectorized:

    eq (equivalent to ==) — equals to
    ne (equivalent to !=) — not equals to
    le (equivalent to <=) — less than or equals to
    lt (equivalent to <) — less than
    ge (equivalent to >=) — greater than or equals to
    gt (equivalent to >) — greater than


In [65]:
df = pd.DataFrame({
    'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})


In [66]:
df2 = df.copy()


In [67]:
df.gt(df2)

Unnamed: 0,one,two,three
a,False,False,False
b,False,False,False
c,False,False,False
d,False,False,False


In [68]:
df2.ne(df)

Unnamed: 0,one,two,three
a,False,False,True
b,False,False,False
c,False,False,False
d,True,False,False


np.nan == np.nan returns False.

You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [69]:
(df>0).all()

one      False
two      False
three    False
dtype: bool

The line of code [`(df>0).all()`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22(df%3E0).all()%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py") is using pandas, a popular data manipulation library in Python, to perform an operation on a DataFrame [`df`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22df%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py").

Let's break it down:

1. `df > 0`: This part of the code is a comparison operation that is applied element-wise on the DataFrame [`df`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22df%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py"). It returns a new DataFrame of the same shape as [`df`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22df%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py"), but with `True` at the positions where the original DataFrame has a value greater than 0, and `False` otherwise.

2. [`.all()`](command:_github.copilot.openSymbolInFile?%5B%22c%3A%2FUsers%2Feddie%2F.vscode%2Fextensions%2Fms-python.vscode-pylance-2024.2.2%2Fdist%2Fbundled%2Fstubs%2Fpandas%2Fcore%2Fframe.pyi%22%2C%22.all()%22%5D "c:/Users/eddie/.vscode/extensions/ms-python.vscode-pylance-2024.2.2/dist/bundled/stubs/pandas/core/frame.pyi"): This method is then called on the resulting DataFrame. The [`all()`](command:_github.copilot.openSymbolInFile?%5B%22c%3A%2FUsers%2Feddie%2F.vscode%2Fextensions%2Fms-python.vscode-pylance-2024.2.2%2Fdist%2Fbundled%2Fstubs%2Fpandas%2Fcore%2Fframe.pyi%22%2C%22all()%22%5D "c:/Users/eddie/.vscode/extensions/ms-python.vscode-pylance-2024.2.2/dist/bundled/stubs/pandas/core/frame.pyi") method returns whether all elements are `True`, over the requested axis. By default, it checks over the index axis (axis=0). If every element in a column is `True`, it will return `True` for that column; if any element is `False`, it will return `False`.

So, [`(df>0).all()`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22(df%3E0).all()%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py") is checking if all elements in each column of the DataFrame [`df`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22df%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py") are greater than 0. The result is a Series object (which is essentially a labeled one-dimensional array) with column labels as indices and either `True` or `False` as values, indicating whether or not all elements in the corresponding column are greater than 0.

In [70]:
(df >0).any()

one       True
two       True
three    False
dtype: bool

In [71]:
(df>0).any().any()

True

The line of code [`(df>0).any().any()`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22(df%3E0).any().any()%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py") is using pandas, a popular data manipulation library in Python, to perform an operation on a DataFrame [`df`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22df%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py").

Let's break it down:

1. `df > 0`: This part of the code is a comparison operation that is applied element-wise on the DataFrame [`df`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22df%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py"). It returns a new DataFrame of the same shape as [`df`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22df%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py"), but with `True` at the positions where the original DataFrame has a value greater than 0, and `False` otherwise.

2. [`.any()`](command:_github.copilot.openSymbolInFile?%5B%22c%3A%2FUsers%2Feddie%2F.vscode%2Fextensions%2Fms-python.vscode-pylance-2024.2.2%2Fdist%2Fbundled%2Fstubs%2Fpandas%2Fcore%2Fframe.pyi%22%2C%22.any()%22%5D "c:/Users/eddie/.vscode/extensions/ms-python.vscode-pylance-2024.2.2/dist/bundled/stubs/pandas/core/frame.pyi"): This method is then called on the resulting DataFrame. The [`any()`](command:_github.copilot.openSymbolInFile?%5B%22c%3A%2FUsers%2Feddie%2F.vscode%2Fextensions%2Fms-python.vscode-pylance-2024.2.2%2Fdist%2Fbundled%2Fstubs%2Fpandas%2Fcore%2Fframe.pyi%22%2C%22any()%22%5D "c:/Users/eddie/.vscode/extensions/ms-python.vscode-pylance-2024.2.2/dist/bundled/stubs/pandas/core/frame.pyi") method returns whether any element is `True`, over the requested axis. By default, it checks over the index axis (axis=0). If any element in a column is `True`, it will return `True` for that column; if all elements are `False`, it will return `False`.

3. [`.any()`](command:_github.copilot.openSymbolInFile?%5B%22c%3A%2FUsers%2Feddie%2F.vscode%2Fextensions%2Fms-python.vscode-pylance-2024.2.2%2Fdist%2Fbundled%2Fstubs%2Fpandas%2Fcore%2Fframe.pyi%22%2C%22.any()%22%5D "c:/Users/eddie/.vscode/extensions/ms-python.vscode-pylance-2024.2.2/dist/bundled/stubs/pandas/core/frame.pyi"): This method is called again on the resulting Series. This time, it checks if there is any `True` value in the Series. If there is at least one `True` value, it will return `True`; if all values are `False`, it will return `False`.

So, [`(df>0).any().any()`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22(df%3E0).any().any()%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py") is checking if there is any value in the DataFrame [`df`](command:_github.copilot.openSymbolInFile?%5B%22..%2F..%2FAnaconda%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22df%22%5D "../../Anaconda/Lib/site-packages/pandas/core/frame.py") that is greater than 0. The result is a single boolean value, either `True` or `False`, indicating whether or not there is at least one value greater than 0 in the DataFrame.



You might be tempted to do the following:

if df:
  pass

or

df and df2

These will both produce errors as you are trying to compare multiple values.


To evaluate single-element pandas objects in a boolean context, use the method bool():

In [72]:
pd.Series([True]).bool()

  pd.Series([True]).bool()


True

In [73]:
pd.Series([False]).bool()

  pd.Series([False]).bool()


False

In [74]:
pd.DataFrame([True]).bool()

  pd.DataFrame([True]).bool()


True

In [75]:
pd.DataFrame([False]).bool()

  pd.DataFrame([False]).bool()


False

## Objects comparison

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:

In [76]:
pd.Series(['foo','bar','baz']) == 'foo'

0     True
1    False
2    False
dtype: bool

Pandas also handles element-wise comparisons between different array-like objects of the same length:

In [77]:
pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])

0     True
1     True
2    False
dtype: bool

Trying to compare Index or Series objects of different lengths will create a ValueError:

In [78]:
pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])

ValueError: Can only compare identically-labeled Series objects

Often you may find that there is more than one way to compute the same result. For example, consider df + df and df * 2. To test that these two computations produce the same result, given the tools shown above, you might imagine using (df + df == df * 2).all().all().

In [None]:
(df + df == df * 2).all()

one      False
two       True
three    False
dtype: bool

In [None]:
df + df == df * 2

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


This happens because of the problem mentioned above that 

In [None]:
np.nan == np.nan

False

So, Pandas objects (such as Series and DataFrames) have an equals() method for testing equality, with NaNs in corresponding locations treated as equal.

In [None]:
(df+df).equals(df*2)


True

Note that the Series or DataFrame index needs to be in the same order for the equality to be True.

## Descriptive Statistics

There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame. All of them are vectorized. Most of them are aggregations and produce a lower-dimensional result.

Generally speaking, these methods take an axis as an argument and the axis can be specified by name or integer:

In [None]:
df.mean(0)

one      0.285534
two     -0.330910
three   -0.169282
dtype: float64

In [None]:
df.mean(1)

a    0.067281
b   -0.946347
c    0.056331
d    0.780302
dtype: float64

By applying vectorized operations, we can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

In [None]:
ts_stand = (df - df.mean()) / df.std()

In [None]:
ts_stand.std()

one      1.0
two      1.0
three    1.0
dtype: float64

Here are some of the most commonly used methods in pandas for descriptive statistics:

1. `count()`: Number of non-null observations
2. `sum()`: Sum of values
3. `mean()`: Mean of values
4. `median()`: Arithmetic median of values
5. `mode()`: Mode of values
6. `std()`: Unbiased standard deviation
7. `min()`: Minimum
8. `max()`: Maximum
9. `abs()`: Absolute Value
10. `prod()`: Product of values
11. `cumsum()`: Cumulative sum
12. `cumprod()`: Cumulative product
13. `quantile([0.25,0.75])`: Returns the 25th and 75th percentile
14. `describe()`: Summary statistics

These methods are typically called on pandas DataFrame or Series objects, like so: `df['column_name'].mean()`.

## Describe

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame:

In [None]:
series = pd.Series(np.random.randn(1000))

In [None]:
series[::2] = np.nan

In [None]:
series.describe()

count    500.000000
mean      -0.045837
std        1.003211
min       -2.400862
25%       -0.778256
50%       -0.129547
75%        0.687897
max        3.313279
dtype: float64

In [None]:
frame = pd.DataFrame(np.random.randn(1000, 5),
                     columns = ['a', 'b', 'c', 'd', 'e'])

In [None]:
frame.iloc[::2] = np.nan


In [None]:
frame.describe()

Unnamed: 0,a,b,c,d,e
count,500.0,500.0,500.0,500.0,500.0
mean,0.028623,-0.037166,-0.015769,0.08692,0.078429
std,1.027404,1.028788,1.011862,0.988492,0.937917
min,-2.882037,-2.69205,-3.008317,-3.034981,-3.023413
25%,-0.748024,-0.68747,-0.707117,-0.520083,-0.541479
50%,0.003569,-0.076171,-0.007183,0.085294,0.052635
75%,0.742505,0.60272,0.637042,0.764683,0.761755
max,3.314278,2.905741,3.02851,3.113863,3.136734


For a non-numerical Series object, describe() will give a simple summary of the number of unique values and the most frequently occurring values:

In [None]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [None]:
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

## Index of Min/Max Values

The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [None]:
s1 = pd.Series(np.random.randn(5))

In [None]:
s1

0   -0.433074
1    0.910450
2   -0.162152
3    0.349457
4    0.777279
dtype: float64

In [None]:
s1.idxmin(), s1.idxmax()

(0, 1)

In [None]:
df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])

In [None]:
df1


Unnamed: 0,A,B,C
0,-0.553053,-0.401219,0.783614
1,0.99947,1.827009,-2.645911
2,-0.500194,0.052038,1.206976
3,-0.943975,1.025752,-0.191107
4,0.891964,1.142942,-1.886593


In [None]:
df1.idxmin(axis=0)

A    3
B    0
C    1
dtype: int64

In [None]:
df1.idxmax(axis=1)

0    C
1    B
2    C
3    B
4    B
dtype: object

## Iterations

The behaviour of basic iterations over pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iterations produces the values. DataFrames follow the dict-like convention of iterating over the keys of the objects.

In short, basic iteration (for i in object) produces:

    Series: values
    DataFrame: column labels


In [None]:
df = pd.DataFrame({'coll':np.random.rand(3),
                   'col2':np.random.rand(3)},
                  index=['a', 'b', 'c'])

In [None]:
for col in df:
    print(col)

coll
col2


To iterate over the rows of a DataFrame, you can use the following methods:

    items(): to iterate over the (key, value) pairs.
    iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to Series objects, which can change the dtypes and has some performance implications.
    itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster than iterrows() and is in most cases preferable to use to iterate over the values of a DataFrame.





Iterating through Pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided.

Warning

You should never modify something you are iterating over. This is not guaranteed to work in all cases.


## items

Consistent with the dict-like interface, items() iterates through key-value pairs:

    Series: (index, scalar value) pairs
    DataFrame: (column, Series) pairs

For example:

In [None]:
df = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})

In [None]:
for label, serie in df.items():
    print(label)
    print(serie)

a
0    1
1    2
2    3
Name: a, dtype: int64
b
0    4
1    5
2    6
Name: b, dtype: int64


## iterrows

iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding each index value along with a Series containing the data in each row:

In [79]:
for row_index, row in df.iterrows():
    print(row_index, row, sep='\n')

a
one      1.151562
two      0.345657
three         NaN
Name: a, dtype: float64
b
one      0.853684
two      0.034525
three   -1.300020
Name: b, dtype: float64
c
one      0.447872
two     -1.222005
three   -0.534410
Name: c, dtype: float64
d
one           NaN
two     -2.130274
three   -1.218212
Name: d, dtype: float64


## itertuple

The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

For example:

In [80]:
for row in df.itertuples():
    print(row)

Pandas(Index='a', one=1.151562427718633, two=0.34565699025993685, three=nan)
Pandas(Index='b', one=0.8536836554669829, two=0.03452493860004058, three=-1.3000196757936338)
Pandas(Index='c', one=0.44787227142306624, two=-1.2220051635035025, three=-0.5344100105714127)
Pandas(Index='d', one=nan, two=-2.1302735218749396, three=-1.218211830987182)
