# Data Cleaning and Preparation
* In this chapter I discuss tools for missing data, duplicate data, string manipulation,
and some other analytical data transformation

## 1. Handling Missing Data
* One of the goals
of pandas is to make working with missing data as painless as possible. For example,
all of the descriptive statistics on pandas objects exclude missing data by default.

In [1]:
import pandas as pd
import numpy as np

In [2]:
float_data = pd.Series([1.2,-3.5,np.nan, 0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [3]:
#first method 'isna' method gives us a Boolean Series with True where values are null
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

* When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

In [4]:
string_data = pd.Series(["aaa",np.nan,None,"bbbb"])
string_data

0     aaa
1     NaN
2    None
3    bbbb
dtype: object

In [5]:
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

- *"Dropna:"*  Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
- *"Fillna:"* Fill in missing data with some value or using an interpolation method such as "ffill" or "bfill".
- *"isna :"* Return Boolean values indicating which values are missing/NA.
- *"notna:"*  Negation of isna, returns True for non-NA values and False for NA values

In [9]:
data = pd.Series([1,np.nan,3.5,np.nan,7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [10]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [11]:
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [13]:
data = pd.DataFrame([[1., 6.5, 3],[1., np.nan, np.nan],[np.nan, np.nan, np.nan],[np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [14]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [15]:
#Passing how="all" will drop only rows that are all NA:
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [16]:
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [17]:
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [21]:
df = pd.DataFrame(np.random.standard_normal((7,3)))
df


Unnamed: 0,0,1,2
0,2.234189,0.159232,-0.913715
1,0.617198,1.566647,-1.017059
2,0.494648,-0.498296,0.065772
3,-0.615406,-0.803713,0.076902
4,-1.523691,1.977457,1.315066
5,1.191546,-0.293786,-1.041479
6,0.819765,-0.444211,-1.523716


In [23]:
df.iloc[:4,1] = np.nan
df.iloc[:2,2] = np.nan
df

Unnamed: 0,0,1,2
0,2.234189,,
1,0.617198,,
2,0.494648,,0.065772
3,-0.615406,,0.076902
4,-1.523691,1.977457,1.315066
5,1.191546,-0.293786,-1.041479
6,0.819765,-0.444211,-1.523716


In [24]:
df.dropna()

Unnamed: 0,0,1,2
4,-1.523691,1.977457,1.315066
5,1.191546,-0.293786,-1.041479
6,0.819765,-0.444211,-1.523716


In [28]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.494648,,0.065772
3,-0.615406,,0.076902
4,-1.523691,1.977457,1.315066
5,1.191546,-0.293786,-1.041479
6,0.819765,-0.444211,-1.523716


### Filling Missing Data
* Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways. 

In [29]:
df.fillna(0)

Unnamed: 0,0,1,2
0,2.234189,0.0,0.0
1,0.617198,0.0,0.0
2,0.494648,0.0,0.065772
3,-0.615406,0.0,0.076902
4,-1.523691,1.977457,1.315066
5,1.191546,-0.293786,-1.041479
6,0.819765,-0.444211,-1.523716


In [31]:
df.fillna({1:0.5, 2:0})

Unnamed: 0,0,1,2
0,2.234189,0.5,0.0
1,0.617198,0.5,0.0
2,0.494648,0.5,0.065772
3,-0.615406,0.5,0.076902
4,-1.523691,1.977457,1.315066
5,1.191546,-0.293786,-1.041479
6,0.819765,-0.444211,-1.523716


In [32]:
 df = pd.DataFrame(np.random.standard_normal((6, 3)))
df

Unnamed: 0,0,1,2
0,-0.484238,0.39788,0.501397
1,0.362401,-0.629825,1.605448
2,-1.217255,0.113623,1.687456
3,1.347515,-1.845861,1.163533
4,0.37266,-0.374423,0.010114
5,1.470687,0.499861,0.72986


In [33]:
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df

Unnamed: 0,0,1,2
0,-0.484238,0.39788,0.501397
1,0.362401,-0.629825,1.605448
2,-1.217255,,1.687456
3,1.347515,,1.163533
4,0.37266,,
5,1.470687,,


In [34]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,0,1,2
0,-0.484238,0.39788,0.501397
1,0.362401,-0.629825,1.605448
2,-1.217255,-0.629825,1.687456
3,1.347515,-0.629825,1.163533
4,0.37266,-0.629825,1.163533
5,1.470687,-0.629825,1.163533


In [35]:
df.fillna(method="ffill",limit=2)

  df.fillna(method="ffill",limit=2)


Unnamed: 0,0,1,2
0,-0.484238,0.39788,0.501397
1,0.362401,-0.629825,1.605448
2,-1.217255,-0.629825,1.687456
3,1.347515,-0.629825,1.163533
4,0.37266,,1.163533
5,1.470687,,1.163533


In [36]:
data =pd.DataFrame([1., np.nan, 3.5, np.nan, 7])
data

Unnamed: 0,0
0,1.0
1,
2,3.5
3,
4,7.0


In [38]:
data.fillna(data.mean())

Unnamed: 0,0
0,1.0
1,3.833333
2,3.5
3,3.833333
4,7.0


## 2. Data Transformation
* Filtering, cleaning, and other transformations are another class of important operations.

### Removing Duplicates

In [40]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],"k2": [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [42]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [43]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [44]:
data["v1"]=range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [45]:
data.drop_duplicates(subset=["k1"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [46]:
data.drop_duplicates(subset=["k1","k2"],keep="last")

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping
*  You may wish to perform some transformation based on the
values in an array, Series, or column in a DataFrame

In [50]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
"pastrami", "corned beef", "bacon",
"pastrami", "honey ham", "nova lox"],
"ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [52]:
meat_to_animal = {
"bacon": "pig",
"pulled pork": "pig",
"pastrami": "cow",
"corned beef": "cow",
"honey ham": "pig",
"nova lox": "salmon"}
data["animal"] = data["food"].map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [53]:
def get_animal(x):
    return meat_to_animal[x]
data["food"].map(get_animal)

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

In [54]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [55]:
data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [56]:
data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [57]:
data.replace({-999:np.nan, -1000:0})


0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes

In [58]:
data = pd.DataFrame(np.arange(12).reshape((3,4)), index=["Izmir","Aydın","Manisa"],columns=["one","two","three","four"])
data

Unnamed: 0,one,two,three,four
Izmir,0,1,2,3
Aydın,4,5,6,7
Manisa,8,9,10,11


In [63]:
def transforms(x):
    return x[:4].upper()
data.index = data.index.map(transforms)
data

Unnamed: 0,one,two,three,four
IZMI,0,1,2,3
AYDI,4,5,6,7
MANI,8,9,10,11


In [65]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Izmi,0,1,2,3
Aydi,4,5,6,7
Mani,8,9,10,11


In [67]:
data.rename(index={"MANI":"Usak"},columns={"three":"peak"})

Unnamed: 0,one,two,peak,four
IZMI,0,1,2,3
AYDI,4,5,6,7
Usak,8,9,10,11


In [69]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
age_categories = pd.cut(ages,bins)
age_categories

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [70]:
age_categories.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [71]:
age_categories.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [72]:
pd.value_counts(age_categories)

  pd.value_counts(age_categories)


(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

In [73]:
pd.cut(ages, bins, right=False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64, left]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

In [74]:
group_names = ["Youth", "YoungAdult", "MiddleAged", "Senior"]
pd.cut(ages,bins,labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

In [76]:
data = np.random.uniform(size=20)
pd.cut(data, 4, precision=2)

[(0.041, 0.25], (0.45, 0.66], (0.25, 0.45], (0.66, 0.86], (0.041, 0.25], ..., (0.45, 0.66], (0.25, 0.45], (0.041, 0.25], (0.041, 0.25], (0.66, 0.86]]
Length: 20
Categories (4, interval[float64, right]): [(0.041, 0.25] < (0.25, 0.45] < (0.45, 0.66] < (0.66, 0.86]]

### Detecting and Filtering Outliers
* Filtering or transforming outliers is largely a matter of applying array operations.

In [77]:
data = pd.DataFrame(np.random.standard_normal((1000,4)))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.036285,-0.050902,0.005943,-0.004743
std,0.991905,0.978518,1.004511,0.994474
min,-3.398059,-3.356962,-3.609879,-3.714284
25%,-0.691826,-0.705213,-0.656201,-0.661465
50%,-0.030767,-0.075738,0.015536,-0.016905
75%,0.626603,0.584604,0.663625,0.658866
max,3.071252,3.218251,3.605091,3.101858


In [78]:
col = data[2]
col[col.abs() > 3]

917    3.605091
939    3.034499
975   -3.609879
981   -3.218031
Name: 2, dtype: float64

* To select all rows having a value exceeding 3 or –3, you can use the any method on a
Boolean DataFrame:

In [79]:
data[(data.abs()>3).any(axis="columns")]

Unnamed: 0,0,1,2,3
98,1.705871,3.218251,1.210938,0.775489
203,1.067833,-0.005422,0.422113,-3.714284
313,-3.398059,-0.805517,1.045631,1.293094
377,-0.043119,-3.356962,1.306071,-1.673376
447,-3.314516,1.181149,-0.043703,-1.742505
473,3.071252,0.199586,-0.833224,-0.149562
885,0.924779,-1.647201,-2.361723,-3.633535
891,-0.418962,0.618443,-1.074626,3.101858
917,-1.047589,0.058031,3.605091,0.815314
939,-1.138941,-0.403203,3.034499,0.179553


In [81]:
data[data.abs() > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.035644,-0.050763,0.006131,-0.003497
std,0.98942,0.976669,0.999719,0.989626
min,-3.0,-3.0,-3.0,-3.0
25%,-0.691826,-0.705213,-0.656201,-0.661465
50%,-0.030767,-0.075738,0.015536,-0.016905
75%,0.626603,0.584604,0.663625,0.658866
max,3.0,3.0,3.0,3.0


### Permutation and Random Sampling
* Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new
ordering

In [83]:
df = pd.DataFrame(np.arange(5*7).reshape((5,7)))
df

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34


In [84]:
sampler = np.random.permutation(5)
sampler

array([3, 4, 1, 2, 0])

In [85]:
df.take(sampler)

Unnamed: 0,0,1,2,3,4,5,6
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
0,0,1,2,3,4,5,6


In [87]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3,4,5,6
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
0,0,1,2,3,4,5,6


In [88]:
df.take(sampler, axis="columns")

Unnamed: 0,3,4,1,2,0
0,3,4,1,2,0
1,10,11,8,9,7
2,17,18,15,16,14
3,24,25,22,23,21
4,31,32,29,30,28


### Computing Indicator/Dummy Variables
* Another type of transformation for statistical modeling or machine learning applica‐
tions is converting a categorical variable into a dummy or indicator matrix.

In [89]:
df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                   "data1":range(6)})
df
                  

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [90]:
pd.get_dummies(df["key"])

Unnamed: 0,a,b,c
0,False,True,False
1,False,True,False
2,True,False,False
3,False,False,True
4,True,False,False
5,False,True,False


In [91]:
dummies = pd.get_dummies(df["key"],prefix="key")
df_with_dummy = df[["data1"]].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,False,True,False
1,1,False,True,False
2,2,True,False,False
3,3,False,False,True
4,4,True,False,False
5,5,False,True,False


In [95]:
values = np.random.uniform(size=10)
values

array([0.63091118, 0.23897839, 0.35630991, 0.2483937 , 0.97649671,
       0.98108817, 0.65938207, 0.52320338, 0.88939248, 0.60208899])

In [97]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,False,False,False,True,False
1,False,True,False,False,False
2,False,True,False,False,False
3,False,True,False,False,False
4,False,False,False,False,True
5,False,False,False,False,True
6,False,False,False,True,False
7,False,False,True,False,False
8,False,False,False,False,True
9,False,False,False,True,False


### Computations with Categoricals
*  Some parts of pandas, like the groupby
function, perform better when working with categoricals. 

In [99]:
rng = np.random.default_rng(seed=12345)
draws = rng.standard_normal(1000)

In [101]:
bins = pd.qcut(draws,4)
bins

[(-3.121, -0.675], (0.687, 3.211], (-3.121, -0.675], (-0.675, 0.0134], (-0.675, 0.0134], ..., (0.0134, 0.687], (0.0134, 0.687], (-0.675, 0.0134], (0.0134, 0.687], (-0.675, 0.0134]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.121, -0.675] < (-0.675, 0.0134] < (0.0134, 0.687] < (0.687, 3.211]]

In [105]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
          .groupby(bins)
          .agg(['count', 'min', 'max'])
          .reset_index())
results

  .groupby(bins)


Unnamed: 0,quartile,count,min,max
0,Q1,250,-3.119609,-0.678494
1,Q2,250,-0.673305,0.008009
2,Q3,250,0.018753,0.686183
3,Q4,250,0.688282,3.211418


In [106]:
 s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [107]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [108]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

In [109]:
cat_s.value_counts()

a    2
b    2
c    2
d    2
Name: count, dtype: int64