# Data Preparation Demo
We start with this __demo csv__ file:

In [8]:
import pandas as pd
df = pd.read_csv('prepDemo.csv') # C:/Users/Stefan/Desktop/notebooks/

In [7]:
df

Unnamed: 0,Name,Value
0,Bill,53.0
1,Jenny,35.0
2,Tom,42.0
3,Deana,37.0
4,John,22.0
5,Ben,
6,Susie,38.0


It shows the occurrence of a missing Value pd.NaN 
If we have a large dataset we might want to use this representation:

In [9]:
df.isnull()

Unnamed: 0,Name,Value
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
5,False,True
6,False,False


We would now want to count the number of occurences of NaN here:

In [10]:
df['Value'].isnull().sum() # .notnull would be the opposit

1

## Handling Missing Data
Luckily there is a bunch of nice functions to manage missing data:

In [11]:
df2 = df # save the frame
df2.dropna()

Unnamed: 0,Name,Value
0,Bill,53.0
1,Jenny,35.0
2,Tom,42.0
3,Deana,37.0
4,John,22.0
6,Susie,38.0


We got rid of Ben :-) Use .dropna(how='all') if you want to drop only these columns where all values are NA. 
And use dropna(axis=1, how='all') to drop all empty columns.

Or we might fill the missing value with a value (e.g. the average) if this is allowed:

In [12]:
df2.fillna(38)

Unnamed: 0,Name,Value
0,Bill,53.0
1,Jenny,35.0
2,Tom,42.0
3,Deana,37.0
4,John,22.0
5,Ben,38.0
6,Susie,38.0


# Filling Missing Data
Deleting NA values is often not the solution. But .fillna is a complex and powerful help here. So fillna has more, like giving it a constant value:

In [13]:
# df.fillna(42)

Or fill each column with a different value by using the column as a key:

In [None]:
# f.fillna(1: 42, 2: 137, 3: 5)

__Many panda commands have the inplace=True parameter to return a changed object. Do not forget that if you leave this out you get a new object and the old object is still unchanged!__

Of course, to write values per hand you can use the index. Let's say we have:

In [14]:
df2

Unnamed: 0,Name,Value
0,Bill,53.0
1,Jenny,35.0
2,Tom,42.0
3,Deana,37.0
4,John,22.0
5,Ben,
6,Susie,38.0


You can either set the value with _iloc_

In [17]:
df2.iloc[5] = 38
df2

Unnamed: 0,Name,Value
0,Bill,53.0
1,Jenny,35.0
2,Tom,42.0
3,Deana,37.0
4,John,22.0
5,38,38.0
6,Susie,38.0


Or automatically by using the many options of .fillna():

In [None]:
# df2.fillna(method='ffill') # to fill the cell 5 with the value 22 from before
# df2.fillna(df2.mean()) # get the mean of all values without the na column

Have a look here at the method parameters: [link](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)

# Other Operations

__DeDuplication__
* .duplicated() # show duplicated with True or False
* .drop_duplicates() # drop all duplicates from the second on
* .drop_duplicates(['mycolumn']) # only look at column mycolumn

__Applying a Function__

* map(...) # Where ... can be a function, dict, or series [link](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html)

Hence with a dict you can replace value via the dictionary and with a function or lambda you can apply any function:

In [21]:
df2

Unnamed: 0,Name,Value
0,Bill,53.0
1,Jenny,35.0
2,Tom,42.0
3,Deana,37.0
4,John,22.0
5,38,38.0
6,Susie,38.0


In [22]:
df2['Value'].map(lambda x: x*2) # Do not forget: df2 is unchanged! Assign if needed!

0    106.0
1     70.0
2     84.0
3     74.0
4     44.0
5     76.0
6     76.0
Name: Value, dtype: float64

__Replacing Data__

* .replace(val, with) # replace val with any other data [4]

__Changing the Axis__

__Discretizatie and Group__

__Outlier Handling__

* col[np.abs(col) > 42] # find columns with an absolute value bigger then 42
* data[(np.abs(data) > 42).any(1)] # select all rows that ...
* Creating a Random Sample

__Manipulating String and Regular Expressions__
