# Cleaning Data

> [Main Table of Contents](../../README.md)

In [6]:
import pandas as pd
import numpy as np

## In This Notebook
- Dx Dirty Data
	- str Type
- Merge multiple datasets
- Rehape Data

## Dx Dirty Data

Investigate | Use pandas functions/methods
--- | ---
Data Types | df.astype()
Data Ranges | df.min(), df.max()
Handle duplicates | duplicated(), drop_duplicates()
Handle NA, NaN, Null, 0, ... values |df.isna(), df.dropna(), df.fillna()<br>quick note: df.fillna() is a more specific pd.replace()<br>df.sort_values('columnName')<br>Visualize missing values with `missingno.matrix(df)`
Membership check | df.isin()
Make sure 'category' types are within range<br>Collapse many categories into fewer | df.cut(), df.replace, when too many values to collapse use string comparison libraries like `thefuzz`
Cross field validation for sanity checks | df.sum(axis=1)

### str Type
- Use `series.str` methods

Method | Description
--- | ---
series.str.split() | Powerful kwarg 'expand'<br>When `expand=True` turns list into dataframe which can be used to create new columns or replace multiple columns at once
series.str.cat() | Useful in combining columns
series.str.get() | Extract elements<br>Alternate to direct indexing lists, tuples

In [7]:
# Example series.str.split(pattern, expand)
s = pd.Series([
        "Ellie-Bellie",
        "Oscar-Do",
        "Toby-Robi",
        pd.NA])
df = pd.DataFrame()
df[['first', 'last']] = s.str.split('-', expand=True)
df

Unnamed: 0,first,last
0,Ellie,Bellie
1,Oscar,Do
2,Toby,Robi
3,,


In [8]:
# Example series.str.cat(pattern)
df['Full Name'] = df['first'].str.cat(df['last'], sep='___')
df

Unnamed: 0,first,last,Full Name
0,Ellie,Bellie,Ellie___Bellie
1,Oscar,Do,Oscar___Do
2,Toby,Robi,Toby___Robi
3,,,


In [9]:
# Example series.str.get()
df['Just First Name'] = s.str.split('-', expand=True)[0]
df['Just Last Name'] = s.str.split('-', expand=True).get(1)
df

Unnamed: 0,first,last,Full Name,Just First Name,Just Last Name
0,Ellie,Bellie,Ellie___Bellie,Ellie,Bellie
1,Oscar,Do,Oscar___Do,Oscar,Do
2,Toby,Robi,Toby___Robi,Toby,Robi
3,,,,,


## Merge multiple datasets
- df.concat(), df.append()
- For complex merges use `recordlinkage` library to combine datasets with different formatted values

## Reshape Data
- WARNING: *Reshape data on unindexed dataframes to avoid losing data*
- When dealing with MultiIndex, the outer levels are 0 or use the name
- After applying aggregate functions remember to unstack levels

pd.df reshaping methods | Description
--- | ---
df.pivot() | A type of long to wide reshaping
df.pivot_table() | A type of long to wide reshaping<br>Use when need to apply summary statistics<br>Use when pivoting multi-index df<br>Use when have some duplicate row values<br>Useful kwarg: `margins=True`<br>default `aggfunc='mean'`<br>Handle NA values with `fill_value` and `dropna` kwargs
df.melt() | A type of wide to long reshaping<br>Unpivot a df<br>Use to collapse columns into two columns (one of variable-can set to any name with `var_name`, one of value-can set to any name with `value_name`)<br>`id_vars` are fixed columns, the ones not being collapsed<br>`value_vars` are the columns want to stack
pd.wide_to_long(df, stubnames, i, j, sep='', suffix='\\d+') | This is a pd function not method<br>Unpivot a df<br>Similar functionality to melt<br>Use when have multiple similar column names and those names can be stripped by regex to a prefix and suffix<br>May be useful after using `pd.json_normalize` which produces similar column names separated by given separator.
df.stack() | A type of wide to long reshaping of index levels (both column-axis and index-axis levels)s<br>Unpivot a level(s) of column-axis to index-axis level(s)<br>If column-axis have multiple levels, specify which column-axis level to index<br>By default, the innermost column-axis will convert to the innermost index-axis<br>To keep ALL data, use `dropna=False` to keep NA values and chain `.fillna(<someValue>)`<br>By default `dropna=False`
df.unstack() | A type of long to wide reshaping of index levels (both column-axis and index-axis levels)<br>Pivot a level(s) of MultiIndex to new level(s) of column-axis<br>By default, the innermost index will convert to innermost column-axis level
df.swaplevel(i=-2, j=-1, axis=0) | Swap levels of MultiIndex dfs 


In [20]:
# LONG df
ldf = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
                           'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
print(ldf)

   foo bar  baz zoo
0  one   A    1   x
1  one   B    2   y
2  one   C    3   z
3  two   A    4   q
4  two   B    5   w
5  two   C    6   t


In [19]:
# Example df.pivot(). long to wide.

# if no values then the rest of cols are values
ldfp = ldf.pivot(index='foo', columns='zoo')  
# bar values grouped by foo and identified by zoo
ldfp = ldf.pivot(index='foo', columns='zoo', values='bar')
ldfp




zoo,q,t,w,x,y,z
foo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
one,,,,A,B,C
two,A,C,B,,,


In [16]:
# Example df.pivot_table().  long to wide.
# The median baz grouped by bar and foo
ldfpt = ldf.pivot_table(index='bar', columns='foo', values='baz', aggfunc='median', margins=True)
ldfpt

foo,one,two,All
bar,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,4,2.5
B,2,5,3.5
C,3,6,4.5
All,2,5,3.5


In [22]:
# Example df.unstack().  long to wide.

# unstack deals with index. So create it
# Set baz to outer and zoo to innermost index
ldfi = ldf.set_index(['baz', 'zoo'])
print(ldfi)
ldfu = ldfi.unstack()
ldfu

         foo bar
baz zoo         
1   x    one   A
2   y    one   B
3   z    one   C
4   q    two   A
5   w    two   B
6   t    two   C


Unnamed: 0_level_0,foo,foo,foo,foo,foo,foo,bar,bar,bar,bar,bar,bar
zoo,q,t,w,x,y,z,q,t,w,x,y,z
baz,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
1,,,,one,,,,,,A,,
2,,,,,one,,,,,,B,
3,,,,,,one,,,,,,C
4,two,,,,,,A,,,,,
5,,,two,,,,,,B,,,
6,,two,,,,,,C,,,,


In [30]:
# WIDE df
np.random.seed(123)
wdf = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
                   "A1980" : {0 : "d", 1 : "e", 2 : "f"},
                   "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
                   "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
                   "X"     : dict(zip(range(3), np.random.randn(3)))
                  })
wdf["id"] = wdf.index
# print(wdf)

df = pd.read_csv('../../data/books.csv')
df.head(3)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic


In [None]:
# Example df.melt().  Wide to long.

wdfp = wdf.melt