# Basic stats

In [2]:
import pandas as pd
import numpy as np

In [64]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


## Sum

Calling DataFrame’s sum method returns a Series containing column sums:


In [4]:
df.sum()

one    9.25
two   -5.80
dtype: float64

Passing axis='columns' or axis=1 sums across the columns instead:

In [5]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

## Mean 

The mean is the average of a set of values. The operation is simple to do: sum
the values and divide by the number of values. The mean is useful because it
shows where the “center of gravity” exists for an observed set of values.

In [60]:
sample = [1, 3, 2, 5, 7, 0, 2, 3, 2]
sm = pd.Series(sample)

In [61]:
sm.mean()

2.7777777777777777

NA values are excluded unless the entire slice (row or column in this case) is NA.
This can be disabled with the skipna option:

In [65]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

## Median 

The median is the middle-most value in a set of ordered values. 
You sequentially order the values, and the median will be the center-most value.
The median can be a helpful alternative to the mean when data is skewed by
outliers, or values that are extremely large and small compared to the rest of the
value H

In [66]:
sm.median()

2.0

## Mode

The mode is the most frequently occurring set of values. It primarily becomes
useful when your data is repetitive and you want to find which values occur the
most frequently.
When no value occurs more than once, there is no mode. When two values occur
with an equal amount of frequency, then the dataset is considered bimodal.

In [67]:
sm.mode()

0    2
dtype: int64

## Variance 

In describing data, we are often interested in measuring the differences between
the mean and every data point. This gives us a sense of how “spread out” the
data is

We square these differences before summing them then average the squared differences. This will give us the variance, a measure of
how spread out our data is.

In [19]:
# Sample variance of values
df['Age'].var()

15.7

## Standard deviation

The opposite of a square is a square root, so let’s take the square root of the
variance which gives us the standard deviation

In [20]:
# Sample standard deviation of values
df['Age'].std()

3.96232255123179

## Other methods

Some methods, like idxmin and idxmax, return indirect statistics like the index value
where the minimum or maximum values are attained:

In [8]:
df.idxmax()

one    b
two    d
dtype: object

Other methods are accumulations:

In [9]:
# Cumulative sum of values
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


Another type of method is neither a reduction nor an accumulation. describe is one
such example, producing multiple summary statistics in one shot:

In [10]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On non-numeric data, describe produces alternative summary statistics:

In [11]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

In [12]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

## Other Descriptive and summary statistics

In [29]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
    'Age': [24, 27, 22, 32, 29],
    'Pay': [200, 600, 150, 1200, 800],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}
df = pd.DataFrame(data)

In [18]:
# Product of all values
df['Age'].prod()

13229568

In [21]:
# Sample skewness (third moment) of values
df['Age'].skew() 

0.12538486713297808

In [22]:
# Sample kurtosis (fourth moment) of values 
df['Age'].kurt() 

-1.1696214856586469

In [23]:
# Cumulative maximum
df['Age'].cummax() 

0    24
1    27
2    27
3    32
4    32
Name: Age, dtype: int64

In [24]:
# Cumulative minimum
df['Age'].cummin()

0    24
1    24
2    22
3    22
4    22
Name: Age, dtype: int64

In [25]:
# Compute percent changes
df['Age'].pct_change()

0         NaN
1    0.125000
2   -0.185185
3    0.454545
4   -0.093750
Name: Age, dtype: float64

In [26]:
# Compute first arithmetic difference (useful for time series)
df['Age'].diff()

0     NaN
1     3.0
2    -5.0
3    10.0
4    -3.0
Name: Age, dtype: float64

In [27]:
# Cumulative product of values
df['Age'].cumprod() 

0          24
1         648
2       14256
3      456192
4    13229568
Name: Age, dtype: int64

# Correlation and Covariance

The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [31]:
df['Age'].corr(df['Pay'])

0.9887779982817946

In [32]:
df['Age'].cov(df['Pay'])

1710.0

DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:

In [36]:
data = {'Age': [24, 27, 22, 32, 29],
        'Pay': [200, 600, 150, 1200, 800],}
df = pd.DataFrame(data)

In [37]:
df.corr()

Unnamed: 0,Age,Pay
Age,1.0,0.988778
Pay,0.988778,1.0


In [38]:
df.cov

<bound method DataFrame.cov of    Age   Pay
0   24   200
1   27   600
2   22   150
3   32  1200
4   29   800>

Using DataFrame’s corrwith method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:

In [39]:
df.corrwith(df['Age'])

Age    1.000000
Pay    0.988778
dtype: float64

Passing axis='columns' does things row-by-row instead. In all cases, the data points
are aligned by label before the correlation is computed.

# Unique Values, Value Counts, and Membership

In [40]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [41]:
# gives you an array of the unique values in a Series:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [44]:
# The unique values are not necessarily returned in sorted order, but could be sorted
# after the fact if needed
uniques.sort()
uniques

array(['a', 'b', 'c', 'd'], dtype=object)

In [45]:
# value_counts computes a Series containing value frequencies:
obj.value_counts()

c    3
a    3
b    2
d    1
Name: count, dtype: int64

isin performs a vectorized set membership check and can be useful in filtering a
dataset down to a subset of values in a Series or column in a DataFrame:

In [46]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [47]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [48]:
# gives you an index array from an array of possibly non-distinct values into another array of distinct values
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2], dtype=int64)

In some cases, you may want to compute a histogram on multiple related columns in
a DataFrame. Here’s an example:

In [49]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [50]:
# Passing pandas.value_counts to this DataFrame’s apply function gives
result = data.apply(pd.value_counts).fillna(0)
result

  result = data.apply(pd.value_counts).fillna(0)


Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
