## CMPINF 2100 | Summarize DataFrames for Exploratory Data Analysis (EDA)

### Import Modules

In [1]:
import numpy as np
import pandas as pd

### Read Data

Let's read in the joined data set we created previously.

In [2]:
df = pd.read_csv('joined_data.csv')

In [3]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


### Summarize Columns

We learned from the previous recording that summarizing columns in a DataFrame is accomplished just as we apply methods to summarize Pandas Series!

In [4]:
df.B.mean()

5.5

In [5]:
df['B'].mean()

5.5

In [6]:
df.F.mean()

17.857142857142858

But how does Pandas deal with MISSING values?

The missing values are *dropped*, *skipped*, or *removed* before the summary function is applied.

In [7]:
df.B.mean(skipna=True)

5.5

In [8]:
df.B.mean(skipna=False)

nan

Missing values prevents the summary statistics from being calculated!

In [9]:
df.F.mean(skipna=False)

17.857142857142858

Many summary functions/methods drop/skip missing values.

In [10]:
df.B.std()

3.605551275463989

In [11]:
df.B.var()

13.0

In [12]:
df.B.min()

0.0

In [13]:
df.B.max()

11.0

What happens if we apply the `.mean()` method to the entire dataframe?

In [14]:
df.mean()

TypeError: can only concatenate str (not "int") to str

In [15]:
df.mean(numeric_only=True)

B      5.500000
C   -650.000000
F     17.857143
G    233.333333
dtype: float64

In [16]:
df.C.mean()

-650.0

We can also calculate the SEM for all numeric columns!

In [17]:
df.sem(numeric_only=True)

B      1.040833
C    104.083300
F      2.385527
G     44.095855
dtype: float64

The SEM comes from the simple formula which is the "standard deviation divided by the square root of the sample size"!

In [18]:
df.F

0     10
1     20
2     10
3     20
4     10
5     20
6     10
7     20
8     10
9     20
10    10
11    20
12    30
13    40
Name: F, dtype: int64

In [19]:
df.F.std()

8.92582375303981

In [20]:
df.F.size

14

In [21]:
np.sqrt(df.F.size)

3.7416573867739413

The SEM for `F` is:

In [22]:
df.F.std()/np.sqrt(df.F.size)

2.385526741328836

In [23]:
df.F.sem()

2.385526741328836

The `.sem()` also **drops** missings. The SEM CANNOT be calculated if missings are condsidered!

In [24]:
df.C

0     -100.0
1     -200.0
2     -300.0
3     -400.0
4     -500.0
5     -600.0
6     -700.0
7     -800.0
8     -900.0
9    -1000.0
10   -1100.0
11   -1200.0
12       NaN
13       NaN
Name: C, dtype: float64

In [25]:
df.C.size

14

The size attribute counts the # of elements within the series, the count returns the number of non-missing entries.

Count: method

Size: attribute

In [26]:
df.C.count()

12

In [27]:
df.C.sem()

104.08329997330664

In [28]:
df.C.sem(skipna=False)

nan

In [29]:
df.C.std()/np.sqrt(df.C.count())

104.08329997330664

All of the previous methods have been used to summarize columns.

However, like `NumPy`, the methods do have the `axis` argument... so we could appply them to individual rows and summarize accross columns rather than down them!

For example, we can calculate the row sample size:

In [30]:
df.mean(axis=1)

TypeError: can only concatenate str (not "float") to str

The DataFrame allows columns to be different data types!

In [31]:
df.dtypes

A     object
B    float64
C    float64
D     object
E     object
F      int64
G    float64
H     object
dtype: object

In [32]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


### Custom Summary Functions

We can define our own functions and apply them to the DataFrame.

To demonstrate, let's define our own Average.

In [39]:
def my_avg(x):
    """ assume x is a Pandas Series and assume x is a numeric data type """
    return x.sum() / x.count()

In [40]:
df.mean(numeric_only=True)

B      5.500000
C   -650.000000
F     17.857143
G    233.333333
dtype: float64

In [41]:
my_avg(df.F)

17.857142857142858

In [42]:
my_avg(df.C)

-650.0

In [43]:
my_avg(df.B)

5.5

In [44]:
my_avg(df.G)

233.33333333333334

But to apply a custom function to the entire DataFrame... we need to use the `.apply()` method. So a method is used to apply the function to the columns! 

In [45]:
df.apply(my_avg)

TypeError: can only concatenate str (not "int") to str

We instead should only apply `my_avg()` to NUMERIC columns!

A simple way to do that is to SELECT all NUMBERS in the DataFrame.

Pandas has a helper method `.select_dtypes()` that allows you to easily select all columns of a particular data type.

In [46]:
df.select_dtypes('number').apply(my_avg)

B      5.500000
C   -650.000000
F     17.857143
G    233.333333
dtype: float64

In [47]:
df.mean(numeric_only=True)

B      5.500000
C   -650.000000
F     17.857143
G    233.333333
dtype: float64

### Methods for Missing Values

There are specialized and pre-defined methods dedicated to identifying missing entries in a DataFrame.

In [48]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


The `.isnull()` method converts a series into a **boolean**! A true value means the entry is missing, while a false corresponds to a value being present.

In [49]:
df.B.isnull()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13     True
Name: B, dtype: bool

But applying `.isnull()` to the ENTIRE DataFrame returns a dataframe of **booleans**

In [50]:
df.isnull()

Unnamed: 0,A,B,C,D,E,F,G,H
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,False,False,False,False


In [51]:
df.isna()

Unnamed: 0,A,B,C,D,E,F,G,H
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,False,False,False,False


Since the result is a DataFrame, we can now apply any appropriate summary method to the **boolean** dataframe. 

Such as, summing the total number of missings

In [52]:
df.isna().sum()

A    2
B    2
C    2
D    2
E    2
F    0
G    5
H    0
dtype: int64

If you want the proportion missing apply the `.mean()` method instead of the `.sum()` method.

In [53]:
df.isna().mean()

A    0.142857
B    0.142857
C    0.142857
D    0.142857
E    0.142857
F    0.000000
G    0.357143
H    0.000000
dtype: float64

The above is the summary per column ... but we can also summarize per row!

In [54]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


In [55]:
df.isna()

Unnamed: 0,A,B,C,D,E,F,G,H
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,False,False,False,False


In [56]:
df.isna().sum(axis=1)

0     0
1     0
2     0
3     0
4     0
5     0
6     1
7     1
8     1
9     0
10    0
11    0
12    6
13    6
dtype: int64

The result shown above is the number of missings in each row.

This is a simple way to identify rows with '0' missings, a.k.a. a complete row

In [57]:
df.loc[df.isna().sum(axis=1) == 0, :]

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB
10,k,10.0,-1100.0,Nov,dd,10,400.0,AAA
11,l,11.0,-1200.0,Dec,dd,20,400.0,BBB


This idea of removing ALL missings, or remove any row that has at least 1 missing column... is known as creating the **complete cases**.

This is what happens behind the scenes in a lot of modeling functions!

In [59]:
df.shape

(14, 8)

In [60]:
df.loc[ df.isna().sum(axis=1) ==0, :].shape

(9, 8)

There is a streamlined operation for creating the complete cases!

In [61]:
df.dropna()

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB
10,k,10.0,-1100.0,Nov,dd,10,400.0,AAA
11,l,11.0,-1200.0,Dec,dd,20,400.0,BBB
