<a href="https://colab.research.google.com/github/Data-Intelligence-Mastery/data_science_interview_questions/blob/master/Q014_replace_bad_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Replacing bad data with Pandas

`Data Analysis Python Pandas Data Cleaning`

Suppose you are given a dataframe, df, that contains various negative values. In the context of your work, negative values can be considered 'bad data'. Write a function in Python (using Pandas) that replaces these bad values with the group mean.

In [64]:
import pandas as pd
from tabulate import tabulate 

df_ = pd.DataFrame([[1, 2, 3],[1, -1, 1], [3, 3, -2], [2, 4, -3]])
print(tabulate(df_, headers='keys', tablefmt='psql'))

+----+-----+-----+-----+
|    |   0 |   1 |   2 |
|----+-----+-----+-----|
|  0 |   1 |   2 |   3 |
|  1 |   1 |  -1 |   1 |
|  2 |   3 |   3 |  -2 |
|  3 |   2 |   4 |  -3 |
+----+-----+-----+-----+


We want to replace the 'bad data', -1 with the average of the column without include -1, whic is 2. And replace -2 and -3 with the column average, which is 2. 


### Solution 1, coding from scratch

In [66]:
def replace_bad_data_1(df_):
  df = df_.copy()
  for col in df.columns:
    if not all(df[col]>0):
      col_mean = df[df[col]>0][col].mean()
      df[col][df[col]<0] = col_mean
  return df

%timeit replace_bad_data_1(df_)
df = replace_bad_data_1(df_)
print(tabulate(df, headers='keys', tablefmt='psql'))

100 loops, best of 3: 4.98 ms per loop
+----+-----+-----+-----+
|    |   0 |   1 |   2 |
|----+-----+-----+-----|
|  0 |   1 |   2 |   3 |
|  1 |   1 |   3 |   1 |
|  2 |   3 |   3 |   2 |
|  3 |   2 |   4 |   2 |
+----+-----+-----+-----+


### Solution 2, use Pandas DataFrame `fillna` function

In [67]:
def replace_bad_data_2(df_):
  df = df_.copy()
  df[df<0]= float('NaN')
  df.fillna(df.mean(),inplace=True)
  return df

%timeit replace_bad_data_2(df_)
df = replace_bad_data_2(df_)
print(tabulate(df, headers='keys', tablefmt='psql'))

100 loops, best of 3: 3.17 ms per loop
+----+-----+-----+-----+
|    |   0 |   1 |   2 |
|----+-----+-----+-----|
|  0 |   1 |   2 |   3 |
|  1 |   1 |   3 |   1 |
|  2 |   3 |   3 |   2 |
|  3 |   2 |   4 |   2 |
+----+-----+-----+-----+


Using `pd.DataFrame`'s native `fillna` function is about 15% faster than function that's coded from scratch.