# Question 2 in Problem Set 6 

*Stats 507, Fall 2021*

Shihao Wu, PhD student in statistics

wshihao@umich.edu

## Imports

The remaining questions will use the following imports.

In [3]:
import pandas as pd
import numpy as np

## Question 0 - Topics in Pandas 

For this question, please pick a topic - such as a function, class, method, recipe or idiom related to the pandas python library and create a short tutorial or overview of that topic. The only rules are below.

1. Pick a topic *not* covered in the class slides.
2. Do not knowingly pick the same topic as someone else.
3. Use bullet points and titles (level 2 headers) to create the equivalent of **3-5** “slides” of key points. They shouldn’t actually be slides, but please structure your key points in a manner similar to the class slides (viewed as a notebook).
4. Include executable example code in code cells to illustrate your topic.

You do not need to clear your topic with me. If you want feedback on your topic choice, please discuss with me or a GSI in office hours.

## Topic: Missing Data in Pandas

Shihao Wu, PhD student in statistics

Reference: [https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

There are 4 "slides" for this topic.


## Missing data
Missing data arises in various circumstances in statistical analysis. Consider the following example:

In [4]:
# generate a data frame with float, string and bool values
df = pd.DataFrame(
    np.random.randn(5, 3),
    index=["a", "c", "e", "f", "h"],
    columns=["1", "2", "3"],
)
df['4'] = "bar"
df['5'] = df["1"] > 0

# reindex so that there will be missing values in the data frame
df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])

df2

Unnamed: 0,1,2,3,4,5
a,-1.163547,0.671639,-0.670563,bar,False
b,,,,,
c,0.585098,-1.573175,-0.819964,bar,True
d,,,,,
e,0.850473,0.427974,-1.146248,bar,True
f,0.401774,-1.416938,0.213961,bar,True
g,,,,,
h,0.990349,-0.945876,-0.458248,bar,True


The missing values come from unspecified rows of data.

## Detecting missing data

To make detecting missing values easier (and across different array dtypes), pandas provides the <code>isna()</code> and <code>notna()</code> functions, which are also methods on Series and DataFrame objects:

In [5]:
df2["1"]

a   -1.163547
b         NaN
c    0.585098
d         NaN
e    0.850473
f    0.401774
g         NaN
h    0.990349
Name: 1, dtype: float64

In [6]:
pd.isna(df2["1"])

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: 1, dtype: bool

In [7]:
df2["4"].notna()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: 4, dtype: bool

In [8]:
df2.isna()

Unnamed: 0,1,2,3,4,5
a,False,False,False,False,False
b,True,True,True,True,True
c,False,False,False,False,False
d,True,True,True,True,True
e,False,False,False,False,False
f,False,False,False,False,False
g,True,True,True,True,True
h,False,False,False,False,False


## Inserting missing data

You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based on the dtype.

For example, numeric containers will always use <code>NaN</code> regardless of the missing value type chosen:

In [9]:
s = pd.Series([1, 2, 3])
s.loc[0] = None
s

0    NaN
1    2.0
2    3.0
dtype: float64

Because <code>NaN</code> is a float, a column of integers with even one missing values is cast to floating-point dtype. pandas provides a nullable integer array, which can be used by explicitly requesting the dtype:

In [10]:
pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())

0       1
1       2
2    <NA>
3       4
dtype: Int64

Likewise, datetime containers will always use <code>NaT</code>.

For object containers, pandas will use the value given:

In [11]:
s = pd.Series(["a", "b", "c"])
s.loc[0] = None
s.loc[1] = np.nan
s

0    None
1     NaN
2       c
dtype: object

## Calculations with missing data 

Missing values propagate naturally through arithmetic operations between pandas objects.

In [12]:
a = df2[['1','2']]
b = df2[['2','3']]
a + b

Unnamed: 0,1,2,3
a,,1.343277,
b,,,
c,,-3.14635,
d,,,
e,,0.855947,
f,,-2.833876,
g,,,
h,,-1.891752,


Python deals with missing value for data structure in a smart way. For example:

* When summing data, NA (missing) values will be treated as zero.
* If the data are all <code>NA</code>, the result will be 0.
* Cumulative methods like <code>cumsum()</code> and <code>cumprod()</code> ignore <code>NA</code> values by default, but preserve them in the resulting arrays. To override this behaviour and include <code>NA</code> values, use <code>skipna=False</code>.

In [13]:
df2

Unnamed: 0,1,2,3,4,5
a,-1.163547,0.671639,-0.670563,bar,False
b,,,,,
c,0.585098,-1.573175,-0.819964,bar,True
d,,,,,
e,0.850473,0.427974,-1.146248,bar,True
f,0.401774,-1.416938,0.213961,bar,True
g,,,,,
h,0.990349,-0.945876,-0.458248,bar,True


In [14]:
df2["1"].sum()

1.664146560537496

In [15]:
df2.mean(1)

a   -0.387491
b         NaN
c   -0.602680
d         NaN
e    0.044066
f   -0.267068
g         NaN
h   -0.137925
dtype: float64

In [16]:
df2[['1','2','3']].cumsum()

Unnamed: 0,1,2,3
a,-1.163547,0.671639,-0.670563
b,,,
c,-0.578449,-0.901536,-1.490527
d,,,
e,0.272024,-0.473563,-2.636775
f,0.673798,-1.890501,-2.422814
g,,,
h,1.664147,-2.836377,-2.881062


In [17]:
df2[['1','2','3']].cumsum(skipna=False)

Unnamed: 0,1,2,3
a,-1.163547,0.671639,-0.670563
b,,,
c,,,
d,,,
e,,,
f,,,
g,,,
h,,,


Missing data is ubiquitous. Dealing with missing is unavoidable in data analysis. This concludes my topic here.