# 1. Example: Filling Missing Values with Group-specific Values

When cleaning up missing data, in some cases you will filter out data observations using dropna, but in others you may want to impute (fill in) the NA values using a fixed value or some value derived form he data. fillna is the right tool to use; for example here I fill in NA values with the mean:

In [6]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [7]:
s = Series(np.random.randn(10))

In [8]:
s[::2] = [np.nan, np.nan, np.nan, np.nan, np.nan]

s

0         NaN
1    0.860078
2         NaN
3    1.303794
4         NaN
5    0.719013
6         NaN
7    0.343548
8         NaN
9    0.159409
dtype: float64

In [9]:
s.fillna(s.mean(0)+1.342)

0    2.019169
1    0.860078
2    2.019169
3    1.303794
4    2.019169
5    0.719013
6    2.019169
7    0.343548
8    2.019169
9    0.159409
dtype: float64

Suppose you need the fill value to very by group. As you may guess, you need only group the data and use *apply* with a functiontion that calls *fillna* on each data chunk. Here is some sample data on some US states divided into eastern and weastern states:

In [10]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']

In [11]:
group_key = ['East'] *4 + ['West'] * 4

In [12]:
data = Series(np.arange(8), index=states)

In [13]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan

In [14]:
data

Ohio          0.0
New York      1.0
Vermont       NaN
Florida       3.0
Oregon        4.0
Nevada        NaN
California    6.0
Idaho         NaN
dtype: float64

In [15]:
data.groupby(group_key).sum()

East     4.0
West    10.0
dtype: float64

In [16]:
for i, j in data.groupby(group_key):
    print(i, '\n', j)

East 
 Ohio        0.0
New York    1.0
Vermont     NaN
Florida     3.0
dtype: float64
West 
 Oregon        4.0
Nevada        NaN
California    6.0
Idaho         NaN
dtype: float64


We can fill the NA values using the group means like so:

In [17]:
fill_mean = lambda g: g.fillna(g.mean())

In [18]:
data.groupby(group_key).apply(fill_mean)

Ohio          0.000000
New York      1.000000
Vermont       1.333333
Florida       3.000000
Oregon        4.000000
Nevada        5.000000
California    6.000000
Idaho         5.000000
dtype: float64

In another case, you might have pre-difine fill values in your code that vary by group.

Since the groups have a *name* attribute set internally, we can use that:

In [19]:
fill_values = {'East': .5, 'West': -1}

In [20]:
fill_func = lambda x: x.fillna(fill_values[x.name])

In [21]:
data.groupby(group_key).apply(fill_func)

Ohio          0.0
New York      1.0
Vermont       0.5
Florida       3.0
Oregon        4.0
Nevada       -1.0
California    6.0
Idaho        -1.0
dtype: float64

# 2. Random Sampling and Permutation

Suppose you wanted to draw a random sample (with or without replacement) from a large dataset for Monte Carlo simulation purposes or some other application. There are a number of wasy to perform the "draws"; some are much more efficient than others. One way is to select the first K elements of np.random.permutation(N), where N is the size of your complete dataset and K the desired sample size. As a more fun example, here's a way to construct a deck of English-style playing cards:

In [22]:
# Hearts, Spades, Clubs, Diamonds

suits = ['H','S', 'C', 'D']

card_val = list(range(1, 11)) + [10] *3

card_val = card_val * 4

base_names = ['A'] + list(range(2, 11)) + ['J', 'K','Q']

In [23]:
cards = []

for suits in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suits for num in base_names)

deck = Series(card_val, index = cards)

So now we have a Series of length 52 whose index contains card names and values are the ones used in blackjack and other games (to keep things simple, I just let the ace be 1):

In [24]:
deck[:13]

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64

Now, based on what I said above, drawing a hand of 5 cards from the desk could be written as:

In [25]:
def draw (deck, n = 4):
    return deck.take(np.random.permutation(len(deck))[:n])

In [26]:
draw(deck)

4S    4
5S    5
3C    3
9S    9
dtype: int64

Suppose you wanted two random cards from each suit. Because the suit is the last character of each card name, we can group based on this and use apply:

In [27]:
get_suit = lambda card: card[-1] #last letter is suit

In [28]:
deck.groupby(get_suit).apply(draw, n = 2)

C  2C      2
   6C      6
D  KD     10
   4D      4
H  7H      7
   10H    10
S  3S      3
   QS     10
dtype: int64

In [29]:
# alternatively

deck.groupby(get_suit, group_keys= False).apply(draw, n = 2)

AC      1
QC     10
10D    10
AD      1
10H    10
QH     10
QS     10
8S      8
dtype: int64

# Example: Group Weighted Average and Correlation

Under the split- apply- combine paradigm of *groupby*, operations between columns in a DataFrame or two Series, such a group weighted average, become a routine affair. As an example, take this dataset containing group keys, values and some weights:

In [30]:
df = DataFrame({'category': ['a', 'a','a', 'a','b','b','b','b'],
                'data': np.random.randn(8),
                'weights': np.arange(8)})

In [31]:
df

Unnamed: 0,category,data,weights
0,a,0.527354,0
1,a,-1.613679,1
2,a,0.67157,2
3,a,0.084891,3
4,b,-1.991321,4
5,b,0.458878,5
6,b,-0.010845,6
7,b,-1.237221,7


The group weighted average by *category* would then be:

In [32]:
grouped = df.groupby('category')

In [33]:
get_wavg = lambda g: np.average(g['data'], weights = g['weights'])

In [34]:
grouped.apply(get_wavg)

category
a   -0.002644
b   -0.654387
dtype: float64

AS a less trivial example, consider a data set from Yahoo! Finance containing end of day prices for a few stocks and the S&P 500 index (the SPX ticker):

In [35]:
close_px = pd.read_csv('../../CSV Files/O_Reilly/ch09/stock_px.csv',
            parse_dates= True, index_col= 0)

In [36]:
close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    2214 non-null   float64
 1   MSFT    2214 non-null   float64
 2   XOM     2214 non-null   float64
 3   SPX     2214 non-null   float64
dtypes: float64(4)
memory usage: 86.5 KB


One Tesk of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns (computed form percent changes) with SPX. Here is one way to do it:

In [37]:
rets = close_px.pct_change().dropna()

rets

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003-01-03,0.006757,0.001421,0.000684,-0.000484
2003-01-06,0.000000,0.017975,0.024624,0.022474
2003-01-07,-0.002685,0.019052,-0.033712,-0.006545
2003-01-08,-0.020188,-0.028272,-0.004145,-0.014086
2003-01-09,0.008242,0.029094,0.021159,0.019386
...,...,...,...,...
2011-10-10,0.051406,0.026286,0.036977,0.034125
2011-10-11,0.029526,0.002227,-0.000131,0.000544
2011-10-12,0.004747,-0.001481,0.011669,0.009795
2011-10-13,0.015515,0.008160,-0.010238,-0.002974


In [38]:
spx_corr = lambda x: x.corrwith(x['SPX'])

spx_corr

<function __main__.<lambda>(x)>

In [39]:
by_year = rets.groupby(lambda x: x.year)

by_year

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002BBBB9AD6C0>

In [40]:
by_year.apply(spx_corr)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003,0.541124,0.745174,0.661265,1.0
2004,0.374283,0.588531,0.557742,1.0
2005,0.46754,0.562374,0.63101,1.0
2006,0.428267,0.406126,0.518514,1.0
2007,0.508118,0.65877,0.786264,1.0
2008,0.681434,0.804626,0.828303,1.0
2009,0.707103,0.654902,0.797921,1.0
2010,0.710105,0.730118,0.839057,1.0
2011,0.691931,0.800996,0.859975,1.0


There is, of course, nothing to stop you form computing inter-column correlations:

In [41]:
# Annual correlation of Apple with Microsoft

by_year.apply(lambda i: i['AAPL'].corr(i['MSFT']))

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

# Example: Group-Wise Linear Regression

In the same vein as the previous example, you can use *groupby* to perform more complex group-wise statistical analysis, as long the function returns a pandas object or scalar value. For example, I can define the following *reggress* function (using the *statsmodels* econometrics library) whic executes an ordinary least squares (OLS) regression on each chunk of data:

In [5]:
import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1
    result = sm.OLS(X, Y).fit()
    return result.params

Now, to run a yearly linear regression of AAPL on SPX returns, I execute:

In [42]:
by_year.apply(regress, 'AAPL', ['SPX'])

Unnamed: 0,Unnamed: 1,0,1
2003,AAPL,0.246326,3.175114
2004,AAPL,0.101922,7.04035
2005,AAPL,0.122066,5.732395
2006,AAPL,0.112147,1.61661
2007,AAPL,0.211554,6.365622
2008,AAPL,0.48032,-1.965627
2009,AAPL,0.559087,8.135493
2010,AAPL,0.4769,6.396784
2011,AAPL,0.588582,5.437102
