# Advanced Groupby Indexing

When you use groupby and want to test on individual pieces of the groupby (generally for debugging), it can be useful to go item by item or examine the different groups. This can be done in two different ways.

In [1]:
import pandas as pd
#Create data + groupby object
df = pd.DataFrame([["Type{}".format(x%5+1), x, x**2] for x in range(200)],
                 columns=["Type", "Value", "Value Squared"])
groupby_object = df.groupby("Type")

You can first, simply convert the groupby object to a list.

In [2]:
#Get the groupby object as a list
groupby_object_list = list(groupby_object)
print("Key: {}".format(groupby_object_list[0][0]))
print("Values (First 3)")
print(groupby_object_list[0][1].head(3))

Key: Type1
Values (First 3)
     Type  Value  Value Squared
0   Type1      0              0
5   Type1      5             25
10  Type1     10            100


You are also able to index into specific keys.

In [3]:
#Get one of the groupby pieces
print(groupby_object.get_group("Type1").head(3))

     Type  Value  Value Squared
0   Type1      0              0
5   Type1      5             25
10  Type1     10            100


# Using Where On DataFrames

Given a dataframe, you can call where and give a boolean condition. This will change any false boolean indices to have a value of null.

In [4]:
import pandas as pd
#Create data + groupby object
df = pd.DataFrame([["Type{}".format(x%5+1), x, x**2] for x in range(200)],
                 columns=["Type", "Value", "Value Squared"])

#Use where
print(df.where((df["Type"] == "Type1") | (df["Type"] == "Type3")))

      Type  Value  Value Squared
0    Type1    0.0            0.0
1      NaN    NaN            NaN
2    Type3    2.0            4.0
3      NaN    NaN            NaN
4      NaN    NaN            NaN
..     ...    ...            ...
195  Type1  195.0        38025.0
196    NaN    NaN            NaN
197  Type3  197.0        38809.0
198    NaN    NaN            NaN
199    NaN    NaN            NaN

[200 rows x 3 columns]


In [5]:
#Pick just values < 3
print(df.where(df["Value"] < 3))

      Type  Value  Value Squared
0    Type1    0.0            0.0
1    Type2    1.0            1.0
2    Type3    2.0            4.0
3      NaN    NaN            NaN
4      NaN    NaN            NaN
..     ...    ...            ...
195    NaN    NaN            NaN
196    NaN    NaN            NaN
197    NaN    NaN            NaN
198    NaN    NaN            NaN
199    NaN    NaN            NaN

[200 rows x 3 columns]


# Align Two DataFrames

Using align you can align two dataframes to have the same rows, columns, or both. First create some dummy data.

In [6]:
import pandas as pd
df1 = pd.DataFrame([[1, 2],
                   [3, 4]], columns=["A", "B"],
                  index=[1, 2])

df2 = pd.DataFrame([[1, 2],
                   [3, 4]], columns=["A", "C"],
                  index=[1, 3])
print(df1)
print()
print(df2)

   A  B
1  1  2
2  3  4

   A  C
1  1  2
3  3  4


By default, align outer joins all columns and all rows and fills with NaN. So every row and column in either dataframe will be pressent in both.

In [7]:
temp1, temp2 = df1.align(df2)
print(temp1)
print()
print(temp2)

     A    B   C
1  1.0  2.0 NaN
2  3.0  4.0 NaN
3  NaN  NaN NaN

     A   B    C
1  1.0 NaN  2.0
2  NaN NaN  NaN
3  3.0 NaN  4.0


The join argument can be changed to inner to get an inner join instead (as well as left or right join). Now we only will get the column A and the row 1 because they are present in both dataframes.

In [8]:
temp1, temp2 = df1.align(df2, join='inner')
print(temp1)
print()
print(temp2)

   A
1  1

   A
1  1


By using axis, you can align to only columns or only rows. 1 will give the align on columns where 0 will give the rows.

In [9]:
temp1, temp2 = df1.align(df2, join='inner', axis=1)
print(temp1)
print()
print(temp2)

   A
1  1
2  3

   A
1  1
3  3


# Switch Types

By using astype, you can set the type of a pandas series to another type. Below the series will return a combination of the strings even though they are supposed to be numbers. By converting to int we can get the actual sum.

In [10]:
import pandas as pd
#Series of strings
df = pd.Series(["1", "2", "3", "4"])

#Sum will be a string
print(df.sum())

1234


In [11]:
#Convert to integer type
df = df.astype(int)

#The sum is a number now
print(df.sum())

10


# Getting the n Smallest Elements of a Series

When given a series of data, it is possible to grab the n smallest numbers by using nsmallest and giving the parameter of n for how many numbers you want.

In [12]:
import pandas as pd
# Create data
df = (pd.Series(list(range(101))) - 50)**2
print(df)

0      2500
1      2401
2      2304
3      2209
4      2116
       ... 
96     2116
97     2209
98     2304
99     2401
100    2500
Length: 101, dtype: int64


In [13]:
#Get n smallest
print(df.nsmallest(n=5))

50    0
49    1
51    1
48    4
52    4
dtype: int64


# Getting the n Smallest Elements of a DataFrame

You can get the n smallest elements of a dataframe by using n_smallest and giving a list of columns. The earliest columns get used for finding the smallest where as subsequent columns break any ties between prior columns.

In [14]:
import pandas as pd
#Create the data
df = pd.concat([(pd.Series(list(range(101))) - 50)**2,
                pd.Series(list(range((101))))], axis=1)
df.columns = ["Value1", "Value2"]
print(df)

     Value1  Value2
0      2500       0
1      2401       1
2      2304       2
3      2209       3
4      2116       4
..      ...     ...
96     2116      96
97     2209      97
98     2304      98
99     2401      99
100    2500     100

[101 rows x 2 columns]


In [15]:
#Find the n smallest by value 1 then 2, notice how the values come out
df.nsmallest(n=3, columns=["Value1", "Value2"])

Unnamed: 0,Value1,Value2
50,0,50
49,1,49
51,1,51


In [16]:
#Find the smallest values by value 2 then 1, notice the values are different now
df.nsmallest(n=3, columns=[ "Value2", "Value1"])

Unnamed: 0,Value1,Value2
0,2500,0
1,2401,1
2,2304,2


# Expanding Windows

Pandas has the ability to create expanding windows which compute functions across larger and larger subsections of the dataframe starting from the beginning. Below we first show how to do this for finding the expanding sum and mean.

In [17]:
import pandas as pd

# Create data
data = pd.Series(list(range(101)))
print(data)

0        0
1        1
2        2
3        3
4        4
      ... 
96      96
97      97
98      98
99      99
100    100
Length: 101, dtype: int64


In [18]:
# Find the expanding sum
expanding_sum = data.expanding().sum()
print(expanding_sum)

0         0.0
1         1.0
2         3.0
3         6.0
4        10.0
        ...  
96     4656.0
97     4753.0
98     4851.0
99     4950.0
100    5050.0
Length: 101, dtype: float64


In [19]:
# Find the expanding mean
expanding_mean = data.expanding().mean()
print(expanding_mean)

0       0.0
1       0.5
2       1.0
3       1.5
4       2.0
       ... 
96     48.0
97     48.5
98     49.0
99     49.5
100    50.0
Length: 101, dtype: float64


## Using Custom Functions

You can also use a custom function with expanding by calling apply. Below, we define a lambda function which returns the number of multiples of 5 by finding the boolean values of whether a number is evenly divisible by 5 then summing all the boolean values (remember true equals 1).

In [20]:
# Apply a custom function
expanding_multiples_of_five = data.expanding().apply(lambda x: ((x % 5) == 0).sum())
print(expanding_multiples_of_five)

0       1.0
1       1.0
2       1.0
3       1.0
4       1.0
       ... 
96     20.0
97     20.0
98     20.0
99     20.0
100    21.0
Length: 101, dtype: float64


# Find Memory Usage

The function memory_usage will return how much memory each column of a dataframe takes up in bytes.

In [21]:
import pandas as pd

# Create data
data = pd.DataFrame([list(range(100000))]).T
data.columns = ["Value1"]
data["Value2"] = data["Value1"] * 10
data["Value3"] = data["Value1"] * 5

# Find memory usage
print(data.memory_usage())

Index        128
Value1    800000
Value2    800000
Value3    800000
dtype: int64


You can of course divide this by the number of bytes in a gigabyte to find that information.

In [22]:
bytes_in_gb = 1073741824

#Find memory in GB
print(data.memory_usage() / bytes_in_gb)

Index     1.192093e-07
Value1    7.450581e-04
Value2    7.450581e-04
Value3    7.450581e-04
dtype: float64
