<a href="https://colab.research.google.com/github/SinghReena/MachineLearning/blob/master/8_DataFrame_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd

### Axis 0 and 1

axis 0 is the rows, axis 1 is the columns.  sum on axis 0 means, sum along the row.

In [None]:
a = np.array([[1, 2], [3, 4]])
print(a)
print("np.sum ", np.sum(a))
print("np.sum (axis 0)",  np.sum(a, axis=0))
print("np.sum (axis 1)", np.sum(a, axis=1))


[[1 2]
 [3 4]]
np.sum  10
np.sum (axis 0) [4 6]
np.sum (axis 1) [3 7]


In [None]:
df = pd.DataFrame({'a': [1, 2, 3, 4],
                   'b': np.arange(5,9),
                   'c': np.arange(100, 104)})
df

Unnamed: 0,a,b,c
0,1,5,100
1,2,6,101
2,3,7,102
3,4,8,103


By default, most operations use `axis = 0` as the default argument.  We can explicitly use `axis = 1` for evaluating row-wise.

In [None]:
df.sum()

a     10
b     26
c    406
dtype: int64

In [None]:
df.sum(axis=1)

0    106
1    109
2    112
3    115
dtype: int64

How about summing columns of non-numeric values?

In [None]:
data = {'students': ['Alice','Bob','Charlie','Dave','Eva', 'Frank'],
      'subjects': ['Bio','Physics','Math','Arts','Chemistry', 'Economics'],
      'score1': [55, 40, 63, 90, 45, 45]}

df = pd.DataFrame(data)
df


Unnamed: 0,students,subjects,score1
0,Alice,Bio,55
1,Bob,Physics,40
2,Charlie,Math,63
3,Dave,Arts,90
4,Eva,Chemistry,45
5,Frank,Economics,45


In [None]:
df.sum()

students             AliceBobCharlieDaveEvaFrank
subjects    BioPhysicsMathArtsChemistryEconomics
score1                                       338
dtype: object

In [None]:
df.score1.sum()

338

## Replace

We can replace values in the dataframe.  

In [None]:
df["score2"] = [45, 55, 40, 90, 20, 25]

In [None]:
df.replace(45, 25)

Unnamed: 0,students,subjects,score1,score2
0,Alice,Bio,55,25
1,Bob,Physics,40,55
2,Charlie,Math,63,40
3,Dave,Arts,90,90
4,Eva,Chemistry,25,20
5,Frank,Economics,25,25


What if we want to replace only in one column?

In [None]:
df.score1.replace(90, 30)

0    55
1    40
2    63
3    30
4    45
5    45
Name: score1, dtype: int64

In [None]:
df

Unnamed: 0,students,subjects,score1,score2
0,Alice,Bio,55,45
1,Bob,Physics,40,55
2,Charlie,Math,63,40
3,Dave,Arts,30,90
4,Eva,Chemistry,45,20
5,Frank,Economics,45,25


## Inplace modification of DataFrames

Notice that most of the modification operations we have seen so far create a new copy Series or a DataFrame object with the modification.  We have two options when we want to modify the original dataset.

The first is the following idiom
```
df = df.replace(...)
```

The second option is to use `inplace=True` as an additional parameter in the function.
```
df.replace("old value", "new value", inplace=True)
```
This will modify the df object with new values in place of the old values.


In [None]:
# this works for one column or the whole dataset.
df.score1.replace(90, 30, inplace = True)
df

Unnamed: 0,students,subjects,score1,score2
0,Alice,Bio,55,45
1,Bob,Physics,40,55
2,Charlie,Math,63,40
3,Dave,Arts,30,90
4,Eva,Chemistry,45,20
5,Frank,Economics,45,25


## Missing values

Missing values occur often in data. It is especially common in surveys. Missing value could mean one of many things:
  - The question is not relevant,
  - The question was asked but the respondent refused to answer,
  - The respondent does not know the answer to the question,
  - The respondent answered but the answer was not recorded (for privacy reasons, or an error in entry.)

For example, when the survey question is 'What is the age of your first child?' or 'What is the weight of your second child?',  the question might not be relevant for people without kids.  Others might refuse to answer. If there is only person with a child of a particular age (say, a newborn), that might reveal the person's identity if there is one person or there are very few people with such a child.  So that data might be removed for privacy reasons.

In many cases, special codes, called *sentinels*, might be entered for each of these categories. In BRFSS dataset, `9999` is used for `Refused`, `7777` for `Don't know`. `BLANK` for `Not asked or missing`.

How do we handle missing data?

Since there are many reasons for data being missing, we will need different strategies for dealing with missing data.
1. We can ignore the missing data
2. Drop the rows/columns that contain the dataset
3. We can infer the value based on other columns of the same row
4. Estimate the values based on nearby rows of the same column

The last two options are called *imputing* values for missing data.

In [None]:
scores = pd.DataFrame({"Alice": [10000, 20000, 30000, np.nan],
                       "Bob": [15000, None, 25000, 30000],
                       "Charlie": [5000, 6000, None, 8200],
                       "Dave": [-1, 2000, None, None],
                       "Eva": [1, 4, None, 16]}
                      )
scores['rounds'] = pd.Series([2010, 2011, 2012, 2013])
scores.set_index('rounds', inplace=True)
scores

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,-1.0,1.0
2011,20000.0,,6000.0,2000.0,4.0
2012,30000.0,25000.0,,,
2013,,30000.0,8200.0,,16.0


### Drop missing values.

Use `axis=` 0 or 1 to 
- drop a column with missing entries
- drop a row with missing entries

In [None]:
# drop any row that has missing entries
scores.dropna()

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,-1.0,1.0


In [None]:
# drop a column with missing data
scores.Dave.dropna()

rounds
2010      -1.0
2011    2000.0
Name: Dave, dtype: float64

### Changing sentinel values with some default value.

Examples: 
- replace -1 with a 0.
- replace `NaN` with 100.

In [None]:
scores.replace(-1, 0, inplace=True)

In [None]:
scores.fillna(100) # inplace = False

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000,15000,5000,-1,1
2011,20000,score,6000,2000,4
2012,30000,25000,score,score,score
2013,score,30000,8200,score,16


### Fill values based on nearby entries.

- backfill
- forward fill

In [None]:
scores

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,0.0,1.0
2011,20000.0,,6000.0,2000.0,4.0
2012,30000.0,25000.0,,,
2013,,30000.0,8200.0,,16.0


In [None]:
scores.fillna(method="bfill")

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,0.0,1.0
2011,20000.0,25000.0,6000.0,2000.0,4.0
2012,30000.0,25000.0,8200.0,,16.0
2013,,30000.0,8200.0,,16.0


In [None]:
scores.fillna(method="pad")

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,0.0,1.0
2011,20000.0,15000.0,6000.0,2000.0,4.0
2012,30000.0,25000.0,6000.0,2000.0,4.0
2013,30000.0,30000.0,8200.0,2000.0,16.0


In [None]:
pop_df.fillna(method="backfill")

Unnamed: 0,Ohio,Iowa,Illinois,Wisconsin,index,rounds
0,10000.0,15000.0,5000.0,1000.0,round1,round1
1,20000.0,25000.0,6000.0,2000.0,round2,round2
2,30000.0,25000.0,8200.0,,round3,round3
3,,30000.0,8200.0,,round4,round4


In [None]:
scores

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,-1.0,1.0
2011,20000.0,,6000.0,2000.0,4.0
2012,30000.0,25000.0,,,
2013,,30000.0,8200.0,,16.0


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate

### Interpolate the values

If values are either increasing or decreasing in a pattern, we can *interpolate* the values for the missing entries, say as an average of the nearby values.

In [None]:
scores.interpolate(method="linear") # inplace=False

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,-1.0,1.0
2011,20000.0,20000.0,6000.0,2000.0,4.0
2012,30000.0,25000.0,7100.0,2000.0,10.0
2013,30000.0,30000.0,8200.0,2000.0,16.0


In [None]:
scores

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,-1.0,1.0
2011,20000.0,,6000.0,2000.0,4.0
2012,30000.0,25000.0,,,
2013,,30000.0,8200.0,,16.0


In [None]:
scores["Eva"].interpolate(method="quadratic")

rounds
2010     1.0
2011     4.0
2012     9.0
2013    16.0
Name: Eva, dtype: float64

In [None]:
scores

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,0.0,1.0
2011,20000.0,20000.0,6000.0,2000.0,4.0
2012,30000.0,25000.0,7100.0,2000.0,10.0
2013,30000.0,30000.0,8200.0,2000.0,16.0


In [None]:
scores

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,-1.0,1.0
2011,20000.0,,6000.0,2000.0,4.0
2012,30000.0,25000.0,,,
2013,,30000.0,8200.0,,16.0


In [None]:
scores.fillna(method="ffill", axis = 1)

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,0.0,1.0
2011,20000.0,20000.0,6000.0,2000.0,4.0
2012,30000.0,25000.0,7100.0,2000.0,10.0
2013,30000.0,30000.0,8200.0,2000.0,16.0


### Which of these methods should one use?

We have shown only some methods for imputation.  There can be more sophistocated methods.  For example, if lattitude and longitude are missing for some entries but the address is given then 
1. we can perhaps query a maps database to get the latitude and longitude. 
2. We can also use other entries in the dataframe with the same values in the address column to get the latitude and longitude.
3. If a nearby address is provided, then we can approximate the latitude and longitude with the nearby entry.

What method to use and how many values to change depends on the dataset and the fields that are being changed.

Each of the methods has implications on other statistics that might be computed.


In [None]:
scores

Unnamed: 0_level_0,Alice,Bob,Charlie,Dave,Eva
rounds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,10000.0,15000.0,5000.0,-1.0,1.0
2011,20000.0,,6000.0,2000.0,4.0
2012,30000.0,25000.0,,,
2013,,30000.0,8200.0,,16.0


In [None]:
senior_age = pd.Series([75, 65, np.nan, 60])
print(senior_age)

s=senior_age.ffill()
s.mean()

0    75.0
1    65.0
2     NaN
3    60.0
dtype: float64


66.25

In [None]:
# If we want to use missing entry as not having participated.

eva_scores.mean()

7.0

In [None]:
# do not compute mean if there is even one `NaN` entry.
eva_scores.mean(skipna=False)

7.0

In [None]:
# if we take a missing value to mean a "failure" to score then average drops.
eva_scores.fillna(0).mean()

5.25

In [None]:
eva_scores.fillna(method = "bfill").mean()

9.25

In [None]:
eva_scores.fillna(method = "ffill").mean()

6.25

The statistics for the column changes depending on
- the method of filling used
- the number of missing values
- the order of the rows
- the distribution of the data.

### Example Scenario 1

We have the ages of individuals in a senior center.  Some are missing values.

1. What will happen if we fill the missing values with zeros?
2. What if we knew that the missing values were for the spouses of the others?3. What if we fill the missing values with nearby values? 


### Example Scenario 2

Let us say we have GDP data for countries from 1960 - 2000. Data for 1997 is missing.

1. What if we fill it with the mean of the GDP of all the years?
2. What if we use backfill or forwardfill?
3. is there a better method?


### Example Scenario 3

Let us say we have sales data for individuals in a company.  One column has missing data for the last few years.  How should we fill this column?

### What if all the values in a slice is a NaN?

What if all the numbers are `NaN`?

Using `nanmean()` will give us a warning.

In [None]:
s1 = pd.Series([np.nan, np.nan, np.nan])
s2 = pd.Series([0, np.nan, np.nan])

print(np.mean(s1))
print(np.mean(s2))

nan
0.0


In [None]:
# nanmean() will return a nan but also give an warning
print(np.nanmean(s1))

nan


  


In [None]:
print(np.nanmean(s2))

0.0
