In [2]:
import numpy as np
import pandas as pd

**Summarizing and Computing Descriptive Statistics**

Pandas objects have a set of common mathematical and statistical methods, most of which are in the category of *reductions* or *summary statistics*, and also, they have methods to deal with missing data

In [3]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=list('abcd'),  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling the *sum* method returns a Series that contains the sums of each of the columns

In [4]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [5]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

When an entire row or column contains Nan values, the sum of that row or column is 0. But, if any values are not Nan, but the rest are, it's sum is NaN. 

In [6]:
df.sum(axis='index', skipna=False)

one   NaN
two   NaN
dtype: float64

You can get around this by using the skipna option.

In [7]:
df.sum(axis='index', skipna=True)

one    9.25
two   -5.80
dtype: float64

Some aggregations, an example being *mean* require at leat one non-NA value to yield a value/result

In [8]:
df.mean(axis='columns')

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

Some methods, like *idxmax* and *idxmin* return indirect statistics, like the index value where the minimum or maximum values are

In [9]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [10]:
df.idxmax()

one    b
two    d
dtype: object

You can get the max and min of the whole DataFrame by using these methods:  

In [11]:
df.stack().idxmax()

('b', 'one')

In [12]:
df.stack().idxmin()

('b', 'two')

Some other methods, like *cumsum* are *accumulations*, which finds the cumulative sum across a given axis

In [13]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [14]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


But, some methods are neither reductions nor accumulations, and *describe* is one of these methods, which produces multiple summary statistics in one shot

In [15]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On nonnumeric data, *describe* produces different summary statistics

In [16]:
obj = pd.Series(list('aabc' * 4))
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

**Correlation and Covariance**

Some summary statistics are computed from  pairs of arguments:

In [17]:
price = pd.read_pickle('C:/Users/savin/Coding/Data Analysis/Python for Data Analysis Book/Chapter 5/Chapter 5.3/yahoo_price.pkl')
volume = pd.read_pickle('C:/Users/savin/Coding/Data Analysis/Python for Data Analysis Book/Chapter 5/Chapter 5.3/yahoo_volume.pkl')

In [18]:
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


The *corr* method returns the overlapping, non-NA, and aligned values in the two series. *cov* returns the covariance

In [19]:
returns['MSFT'].corr(returns["IBM"])

0.49976361144151155

In [20]:
returns['MSFT'].cov(returns["IBM"])

8.870655479703546e-05

However, DataFrame's *corr* and *cov* methods returns full correlation or covariance matrix, respectively

In [21]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [22]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


Using the *corrwith* method, you can get pair-wise correlations between a DataFrame's columns and rows with those from a different DataFrame or Series

In [23]:
returns.corrwith(returns['IBM'])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

Passing a DataFrame returns the correlations of matching column names

In [24]:
returns.corrwith(volume)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

**Unique Values, Value Counts, and Membership**

In [25]:
obj = pd.Series(list('cadaabbcc'))
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [26]:
obj.value_counts()

c    3
a    3
b    2
d    1
Name: count, dtype: int64

By default, value_counts displays the values as sorted. But, this can be changed with the *sort* keyword

In [27]:
obj.value_counts(sort=False)

c    3
a    3
d    1
b    2
Name: count, dtype: int64

*isin* performs a vectorized set membership check and is used to filter a dataset down to a subset of values

In [28]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [29]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [30]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [31]:
to_match = pd.Series(list('cabbca'))
unique_vals = pd.Series(list('cba'))

index = pd.Index(unique_vals).get_indexer(to_match)
index

array([0, 2, 1, 1, 0, 2], dtype=int64)

Sometimes you may want to make a histogram of many related columns in a DataFrame

In [32]:
data = pd.DataFrame({"Qu1" : [1, 3, 4, 3, 4], "Qu2" : [2, 3, 1, 2, 3], "Qu3" : [1, 5, 2, 4, 4]})

data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


The value counts of a single column can be computed like this: 

In [33]:
data['Qu1'].value_counts().sort_index()

Qu1
1    1
3    2
4    2
Name: count, dtype: int64

To find this for all the columns, pass *pd.Series.value_counts* to the DataFrame's apply method

In [39]:
resultt = data.apply(pd.Series.value_counts).fillna(0)
resultt

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


There is also a DataFrame.value_counts method, but it computes counts while considering each row of the Dataframe as a tuple, and this determines the number of occurences of each row

In [40]:
data = pd.DataFrame({'a' : [1, 1, 1, 2, 2], 'b' : [0, 0, 1, 0, 0]})
data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [41]:
data.value_counts()

a  b
1  0    2
2  0    2
1  1    1
Name: count, dtype: int64