In [1]:
import numpy as np
import pandas as pd

# <p style="color:#85200c;">7.1 Handling Missing Data</p>

For data with float64
dtype, pandas uses the floating-point value **NaN** (Not a Number) to represent
missing data:

In [2]:
float_data = pd.Series([1.2, -3.5, np.nan, 0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

Use *isna* or *isnull* method to give you a Boolean Series with True where values are
**null**:

In [3]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

### Filtering Out Missing Data

Use *dropna* to return a Series with drop **nan** values, also this is equvelent to boolean non-nan indexing:

In [4]:
float_data.dropna()

0    1.2
1   -3.5
3    0.0
dtype: float64

In [5]:
float_data[float_data.notna()]

0    1.2
1   -3.5
3    0.0
dtype: float64

*dropna* also works with DataFrames pandas object, with **axis="row"** parameter defalut value:

In [6]:
data = pd.DataFrame(np.random.normal(size=12).reshape(4, 3))
data.iloc[3] = data.iloc[2, [0, 2]] = data.iloc[0, 0] = np.NaN
data

Unnamed: 0,0,1,2
0,,-0.349712,-2.219474
1,-0.713211,0.290986,0.712127
2,,0.92551,
3,,,


In [7]:
data.dropna()

Unnamed: 0,0,1,2
1,-0.713211,0.290986,0.712127


In [8]:
data.dropna(axis="columns") # returns empty df

0
1
2
3


passing "all" value to how parameter will drop only all **nan** rows/columns:

In [9]:
data.dropna(how="all")

Unnamed: 0,0,1,2
0,,-0.349712,-2.219474
1,-0.713211,0.290986,0.712127
2,,0.92551,


you can also specifying nan dropping threshold using **thresh** parameter:

In [10]:
data.dropna(thresh=2) # this will drop all rows with at least 2-nan values

Unnamed: 0,0,1,2
0,,-0.349712,-2.219474
1,-0.713211,0.290986,0.712127


### Filling In Missing Data

Calling *fillna* with a constant replaces missing values with that value:

In [12]:
data.fillna(0)

Unnamed: 0,0,1,2
0,0.0,-0.349712,-2.219474
1,-0.713211,0.290986,0.712127
2,0.0,0.92551,0.0
3,0.0,0.0,0.0


if you call *fillna* with a dictionary, you can use a different fill value for each
column:

In [13]:
data.fillna({0: 0, 1:1})

Unnamed: 0,0,1,2
0,0.0,-0.349712,-2.219474
1,-0.713211,0.290986,0.712127
2,0.0,0.92551,
3,0.0,1.0,


also, you can use some predefined methods(**ffill**/**bfill**) to fill based on by assigning to to *method* parameter:

In [14]:
data.fillna(method='ffill') # forward fill

Unnamed: 0,0,1,2
0,,-0.349712,-2.219474
1,-0.713211,0.290986,0.712127
2,-0.713211,0.92551,0.712127
3,-0.713211,0.92551,0.712127


With fillna you can do lots of other things such as simple data imputation
using the **median** or **mean** statistics:

In [15]:
data.fillna(data.mean())

Unnamed: 0,0,1,2
0,-0.713211,-0.349712,-2.219474
1,-0.713211,0.290986,0.712127
2,-0.713211,0.92551,-0.753674
3,-0.713211,0.288928,-0.753674


See <a href="#table7.1">Table 7-1</a> for a reference on *fillna* function arguments.

<table id="table7.1" style="float:left;">
    <caption style="font-style:italic"> Table 7-1. fillna function arguments </caption>
    <tr><th style="text-align:left;ma">Argument</th><th style="text-align:left;padding-left: 50px">Description</th></tr>
    <tr><td style="text-align:left;">value</td><td style="text-align:left;padding-left: 50px">Scalar value or dictionary-like object to use to fill missing
values</td></tr>
    <tr><td style="text-align:left;">method</td><td style="text-align:left;padding-left: 50px">Interpolation method: one of "bfill" (backward fill) or "ffill" (forward fill);
default is None</td></tr>
    <tr><td style="text-align:left;">axis</td><td style="text-align:left;padding-left: 50px">Axis to fill on ("index" or "columns"); default is axis="index"</td></tr>
    <tr><td style="text-align:left;">limt</td><td style="text-align:left;padding-left: 50px">For forward and backward filling, maximum number of
consecutive periods to fill</td></tr>

</table>

# <p style="color:#85200c;">7.2 Data Transformation</p>

### Removing Duplicates

Starting with `duplicated`, which returns a Boolean Series indicating whether each row is a duplicate:

In [17]:
data = pd.DataFrame(
    {
    "k1": ["one", "two"] * 3 + ["two"],
    "k2": [1, 1, 2, 3, 3, 4, 4]
})

data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [18]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

to drop them, use `drop_duplicates` method:

In [20]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


if you to apply duplicate checking based on a subset of columns, specify them in `subset` parameter, and use `keep` to specify the dropping direction, consider:

In [22]:
data.drop_duplicates(subset=["k1"], keep="last")

Unnamed: 0,k1,k2
4,one,3
6,two,4


### Transforming Data Using a Function or Mapping

there are some usful functions to apply element-wise operations, consider `map` method, which apply a function on all particular Series elements:

In [33]:
data = pd.DataFrame({
    "food": ["bacon", "pulled pork", "bacon", "pastrami", "corned beef", "bacon", "pastrami", "honey ham", "nova lox"],
    "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]
    })

data["ounces"].map(lambda x: x + 6)

0    10.0
1     9.0
2    18.0
3    12.0
4    13.5
5    14.0
6     9.0
7    11.0
8    12.0
Name: ounces, dtype: float64

or a dictionary as follow:


In [34]:
meat_to_animal = {
"bacon": "pig",
"pulled pork": "pig",
"pastrami": "cow",
"corned beef": "cow",
"honey ham": "pig",
"nova lox": "salmon"
}

data["food"].map(meat_to_animal)

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

you can also use `applymap` to apply an element-wise operation on a DataFrame, and `apply` to apply a column-wise operation.

### Replacing Values

using `replace` method, with it's simple signiture:

In [35]:
data.replace("bacon", "pig")

Unnamed: 0,food,ounces
0,pig,4.0
1,pulled pork,3.0
2,pig,12.0
3,pastrami,6.0
4,corned beef,7.5
5,pig,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


or replace a list of values with new one:

In [36]:
data.replace(["bacon", "pulled pork"], "pig")

Unnamed: 0,food,ounces
0,pig,4.0
1,pig,3.0
2,pig,12.0
3,pastrami,6.0
4,corned beef,7.5
5,pig,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


or list-to-lits replacing:

In [37]:
data.replace(["bacon", "pastrami"], ["pig", "cow"])

Unnamed: 0,food,ounces
0,pig,4.0
1,pulled pork,3.0
2,pig,12.0
3,cow,6.0
4,corned beef,7.5
5,pig,8.0
6,cow,3.0
7,honey ham,5.0
8,nova lox,6.0


or even you can pass a dictionary, this will work as the same way in list-to-list replacing.

### Renaming Axis Indexes

By using `rename` method and pass a function or dictionary to index or columns parameters, this will work as `map` method on a Series object:

In [38]:
data = pd.DataFrame(
    np.arange(12).reshape((3, 4)),
    index=["Ohio", "Colorado", "New York"],
    columns=["one", "two", "three", "four"]
    )

data.rename(index=str.upper, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


### Discretization and Binning

Continuous data is often discretized or otherwise separated into *bins* for
analysis. <br>Suppose you have data about a group of people in a study, and you
want to group them into discrete age buckets, then we can categorize them based on their ages:

In [40]:
ages = np.array([20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32])
bins = [18, 25, 35, 60, 100]
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]

pd.cut(ages, bins=bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

`bins` can also take an integer, the number of categories, and `right` parameter which specifies where the interval is closed consider:

In [49]:
data = np.random.randint(1, 100, size=100)
pd.cut(data, bins=4, right=False) # will create left closed intervals

[[2.0, 26.25), [50.5, 74.75), [26.25, 50.5), [74.75, 99.097), [74.75, 99.097), ..., [26.25, 50.5), [26.25, 50.5), [26.25, 50.5), [2.0, 26.25), [74.75, 99.097)]
Length: 100
Categories (4, interval[float64, left]): [[2.0, 26.25) < [26.25, 50.5) < [50.5, 74.75) < [74.75, 99.097)]

pandas `qcut` bins the data based on sample quantiles, so that you will obtain roughly equally
sized bins:

In [63]:
pd.qcut(ages, 4).value_counts()

(19.999, 22.75]    3
(22.75, 29.0]      3
(29.0, 38.0]       3
(38.0, 61.0]       3
Name: count, dtype: int64

In [64]:
pd.cut(ages, 4).value_counts()

(19.959, 30.25]    6
(30.25, 40.5]      3
(40.5, 50.75]      2
(50.75, 61.0]      1
Name: count, dtype: int64

Similar to pandas.cut, you can pass your own quantiles (numbers between
0 and 1, inclusive):

In [69]:
pd.qcut(ages, [0, 0.1, 0.5, 0.9, 1.]).value_counts() #10% of data, 50%, 90%, and then 100%

(19.999, 21.1]    2
(21.1, 29.0]      4
(29.0, 44.6]      4
(44.6, 61.0]      2
Name: count, dtype: int64