# BWT - Deep Learning Track
## Data Cleaning and Preparation
### Adil Mubashir Chaudhry

In this notebook we will discuss tools for missing data, duplicate data, string manipulation,
and some other analytical data transformations

### Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals
of pandas is to make working with missing data as painless as possible. For example,
all of the descriptive statistics on pandas objects exclude missing data by default.

In [4]:
import pandas as pd
import numpy as np

string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avacado'])
print(string_data)

0     aardvark
1    artichoke
2          NaN
3      avacado
dtype: object


In [5]:
print(string_data.isnull())

0    False
1    False
2     True
3    False
dtype: bool


In [9]:
cleaned = string_data.dropna()
cleaned

0     aardvark
1    artichoke
3      avacado
dtype: object

In [10]:
from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA,NA,NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [11]:
cleaned_df = data.dropna()
cleaned_df

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [13]:
cleaned_df_all = data.dropna(how='all')
cleaned_df_all

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


### Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways.

In [14]:
filled_data = data.fillna(0)
filled_data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,3.0


### Removing Duplicates

In [16]:
data = pd.DataFrame({'k1':['one', 'two']*3 + ['two'], 'k2':[1,1,2,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [17]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [19]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


### Detecting and Filtering Outliers

In [20]:
data = pd.DataFrame(np.random.randn(1000,4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.006353,0.029269,0.039295,0.000672
std,1.008809,1.064094,0.975869,1.003045
min,-3.559713,-2.96012,-3.079701,-2.885767
25%,-0.714073,-0.721393,-0.622616,-0.689482
50%,-0.012113,-0.0366,0.063705,0.029277
75%,0.693272,0.723633,0.690712,0.693805
max,2.89015,3.781627,2.877248,2.970899


In [21]:
col = data[2]
col[np.abs(col) > 3]

838   -3.079701
Name: 2, dtype: float64

In [25]:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.005388,0.027409,0.039375,0.000672
std,1.0057,1.058418,0.975617,1.003045
min,-3.0,-2.96012,-3.0,-2.885767
25%,-0.714073,-0.721393,-0.622616,-0.689482
50%,-0.012113,-0.0366,0.063705,0.029277
75%,0.693272,0.723633,0.690712,0.693805
max,2.89015,3.0,2.877248,2.970899
