# Basic stats

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


## Sum

Calling DataFrame’s sum method returns a Series containing column sums:


In [3]:
df.sum()

one    9.25
two   -5.80
dtype: float64

Passing axis='columns' or axis=1 sums across the columns instead:

In [4]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

## Mean 

The mean is the average of a set of values. The operation is simple to do: sum
the values and divide by the number of values. The mean is useful because it
shows where the “center of gravity” exists for an observed set of values.

In [5]:
sample = [1, 3, 2, 5, 7, 0, 2, 3, 2]
sm = pd.Series(sample)

In [6]:
sm.mean()

2.7777777777777777

NA values are excluded unless the entire slice (row or column in this case) is NA.
This can be disabled with the skipna option:

In [7]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

## Median 

The median is the middle-most value in a set of ordered values. 
You sequentially order the values, and the median will be the center-most value.
The median can be a helpful alternative to the mean when data is skewed by
outliers, or values that are extremely large and small compared to the rest of the
value H

In [8]:
sm.median()

2.0

## Mode

The mode is the most frequently occurring set of values. It primarily becomes
useful when your data is repetitive and you want to find which values occur the
most frequently.
When no value occurs more than once, there is no mode. When two values occur
with an equal amount of frequency, then the dataset is considered bimodal.

In [9]:
sm.mode()

0    2
dtype: int64

## Variance 

In describing data, we are often interested in measuring the differences between
the mean and every data point. This gives us a sense of how “spread out” the
data is

We square these differences before summing them then average the squared differences. This will give us the variance, a measure of
how spread out our data is.

In [10]:
# Sample variance of values
df['Age'].var()

KeyError: 'Age'

## Standard deviation

The opposite of a square is a square root, so let’s take the square root of the
variance which gives us the standard deviation

In [None]:
# Sample standard deviation of values
df['Age'].std()

## Other methods

Some methods, like idxmin and idxmax, return indirect statistics like the index value
where the minimum or maximum values are attained:

In [None]:
df.idxmax()

Other methods are accumulations:

In [None]:
# Cumulative sum of values
df.cumsum()

Another type of method is neither a reduction nor an accumulation. describe is one
such example, producing multiple summary statistics in one shot:

In [None]:
df.describe()

On non-numeric data, describe produces alternative summary statistics:

In [None]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

In [None]:
obj.describe()

## Other Descriptive and summary statistics

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
    'Age': [24, 27, 22, 32, 29],
    'Pay': [200, 600, 150, 1200, 800],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}
df = pd.DataFrame(data)

In [None]:
# Product of all values
df['Age'].prod()

In [None]:
# Sample skewness (third moment) of values
df['Age'].skew() 

In [None]:
# Sample kurtosis (fourth moment) of values 
df['Age'].kurt() 

In [None]:
# Cumulative maximum
df['Age'].cummax() 

In [None]:
# Cumulative minimum
df['Age'].cummin()

In [None]:
# Compute percent changes
df['Age'].pct_change()

In [None]:
# Compute first arithmetic difference (useful for time series)
df['Age'].diff()

In [None]:
# Cumulative product of values
df['Age'].cumprod() 

# Correlation and Covariance

The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [None]:
df['Age'].corr(df['Pay'])

In [None]:
df['Age'].cov(df['Pay'])

DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:

In [None]:
data = {'Age': [24, 27, 22, 32, 29],
        'Pay': [200, 600, 150, 1200, 800],}
df = pd.DataFrame(data)

In [None]:
df.corr()

In [None]:
df.cov

Using DataFrame’s corrwith method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:

In [None]:
df.corrwith(df['Age'])

Passing axis='columns' does things row-by-row instead. In all cases, the data points
are aligned by label before the correlation is computed.

# Unique Values, Value Counts, and Membership

In [None]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [None]:
# gives you an array of the unique values in a Series:
uniques = obj.unique()
uniques

In [None]:
# The unique values are not necessarily returned in sorted order, but could be sorted
# after the fact if needed
uniques.sort()
uniques

In [None]:
# value_counts computes a Series containing value frequencies:
obj.value_counts()

isin performs a vectorized set membership check and can be useful in filtering a
dataset down to a subset of values in a Series or column in a DataFrame:

In [None]:
mask = obj.isin(['b', 'c'])
mask

In [None]:
obj[mask]

In [None]:
# gives you an index array from an array of possibly non-distinct values into another array of distinct values
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)

In some cases, you may want to compute a histogram on multiple related columns in
a DataFrame. Here’s an example:

In [None]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

In [None]:
# Passing pandas.value_counts to this DataFrame’s apply function gives
result = data.apply(pd.value_counts).fillna(0)
result