# Unit 5 Lecture 2 - Wrangling Data

ESI4628: Decision Support Systems for Industrial Engineers<br>
University of Central Florida
Dr. Ivan Garibay, Ramya Akula, Mostafa Saeidi, Madeline Schiappa, and Brett Belcher. 
https://github.com/igaribay/DSSwithPython/blob/master/DSS-Week05/Notebook/DSS-Unit05-Lecture02.2018.ipynb

## Notebook Learning Objectives
After studying this notebook students should be able to:
- Reshaping data using stack and unstack index data hierarchically
- aggregate, group, filter and transform data
- merge datasets using inner, outer, left and right join operations
- create Pivot tables
- remove NaN from data and remove duplicates

# Overview

When the large amounts of data are spread across various files, accessing that data is lot easier when organized properly. In this chapter we will learn how to aggregate the data by wrangling as needed. 

In [10]:
import pandas as pd
import numpy as np

# Hierarchical Indexing

Dealing with higher dimensional data is always a challenge, so we need to index them hierarchically, this helps in working with higher dimensional data in lower dimensional form, also known as hierarchical indexing. Lets begin with an simple example. 

In [12]:
pd.read_excel('../Data/DSS-Unit05-File01.xlsx', sheet_name = 'Sales') #importing data from an Excel file

Unnamed: 0,Apple,Amazon,Alphabet
2018-06-30,53427,52886,32758
2018-03-31,61224,51042,31393
2017-12-31,88477,60453,32521
2017-09-30,52574,43744,27963
2017-06-30,45260,37955,26007


To create a hierarchical index, we simply define the index of the dataframe as two or more lists as follows:

In [30]:
data = pd.DataFrame(np.random.randn(27).reshape(9,3), index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd' ], [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

Unnamed: 0,Unnamed: 1,0,1,2
a,1,-0.561031,0.182514,0.934103
a,2,1.174967,2.440814,0.094836
a,3,0.265001,0.347549,-0.750255
b,1,-0.555708,2.614888,-0.636853
b,3,1.272148,0.54357,1.761683
c,1,1.355795,1.143298,1.112891
c,2,0.659473,-2.144723,1.581474
d,2,0.119811,-0.632336,-0.261156
d,3,0.93805,-0.490501,-0.37692


A hierarchical index can be also at the column indexes:

In [31]:
data = pd.DataFrame(np.random.randn(27).reshape(9,3), index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd' ], [1, 2, 3, 1, 3, 1, 2, 2, 3]], columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
data

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,-0.252847,-0.646841,-1.760724
a,2,0.289878,-0.193996,-1.244343
a,3,0.637832,-0.624122,-0.381844
b,1,-2.055031,0.230039,0.854829
b,3,0.424692,1.336254,1.890008
c,1,-0.334377,-2.065203,-1.381171
c,2,-0.388404,-0.526469,-0.922241
d,2,1.117908,2.556997,-0.265275
d,3,0.359539,0.275689,0.621303


# Reshaping Data Frames

We use <code>.stack()</code> and <code>.unstack()</code> methods to reshape the Data Frame by exchanging row for column indexes. Lets load the following Excel file containing financial data per quarter of three big tech firms to illustrate.
** Note: ** I am using <code>.read_excel()</code> to read the Excel file. I am providing _Excel file location_, _sheet name inside the file_, and that _first and second rows are headers_ not data.

In [36]:
findata = pd.read_excel('../Data/DSS-Unit05-File01.xlsx', sheet_name = 'Financials_v1', header = [0,1]) #importing data from an Excel file
findata

Unnamed: 0_level_0,Sales/Revenue,Sales/Revenue,Sales/Revenue,Gross Income/Profit,Gross Income/Profit,Gross Income/Profit,Total Assets,Total Assets,Total Assets,Total Liabilities,Total Liabilities,Total Liabilities
Unnamed: 0_level_1,Apple,Amazon,Alphabet,Apple,Amazon,Alphabet,Apple,Amazon,Alphabet,Apple,Amazon,Alphabet
2018-06-30,53427,52886,32758,20789,22254,18875,375319,134100,197295,241272,99105,44793
2018-03-31,61224,51042,31393,23530,20307,17926,321686,126362,167497,193437,94899,28461
2017-12-31,88477,60453,32521,34069,21959,18254,290479,131310,147461,171124,103601,27130
2017-09-30,52574,43744,27963,19902,16195,16815,231839,115267,129187,120292,90609,25327
2017-06-30,45260,37955,26007,17267,14504,15634,207000,87781,110920,83451,64567,23611


Now lets use <code>.stack()</code> and see what happens:

In [47]:
stacked_findata = findata.stack()
stacked_findata

Unnamed: 0,Unnamed: 1,Gross Income/Profit,Sales/Revenue,Total Assets,Total Liabilities
2018-06-30,Alphabet,18875,32758,197295,44793
2018-06-30,Amazon,22254,52886,134100,99105
2018-06-30,Apple,20789,53427,375319,241272
2018-03-31,Alphabet,17926,31393,167497,28461
2018-03-31,Amazon,20307,51042,126362,94899
2018-03-31,Apple,23530,61224,321686,193437
2017-12-31,Alphabet,18254,32521,147461,27130
2017-12-31,Amazon,21959,60453,131310,103601
2017-12-31,Apple,34069,88477,290479,171124
2017-09-30,Alphabet,16815,27963,129187,25327


The inner most column index (Apple, Amazon, Alphabet) was transposed from columns into rows. Now lets try to reverse this operation with <code>.unstack()</code>

In [49]:
stacked_findata.unstack() # operation is reversible

Unnamed: 0_level_0,Gross Income/Profit,Gross Income/Profit,Gross Income/Profit,Sales/Revenue,Sales/Revenue,Sales/Revenue,Total Assets,Total Assets,Total Assets,Total Liabilities,Total Liabilities,Total Liabilities
Unnamed: 0_level_1,Alphabet,Amazon,Apple,Alphabet,Amazon,Apple,Alphabet,Amazon,Apple,Alphabet,Amazon,Apple
2018-06-30,18875,22254,20789,32758,52886,53427,197295,134100,375319,44793,99105,241272
2018-03-31,17926,20307,23530,31393,51042,61224,167497,126362,321686,28461,94899,193437
2017-12-31,18254,21959,34069,32521,60453,88477,147461,131310,290479,27130,103601,171124
2017-09-30,16815,16195,19902,27963,43744,52574,129187,115267,231839,25327,90609,120292
2017-06-30,15634,14504,17267,26007,37955,45260,110920,87781,207000,23611,64567,83451


Lets load another example from the same Excel file. This example adds a third index for columns:

In [53]:
findata2 = pd.read_excel('../Data/DSS-Unit05-File01.xlsx', sheet_name = 'Financials_v3', header = [0,1,2]) #importing data from an Excel file
findata2

Unnamed: 0_level_0,Income Statement,Income Statement,Income Statement,Income Statement,Income Statement,Income Statement,Balance Sheet,Balance Sheet,Balance Sheet,Balance Sheet,Balance Sheet,Balance Sheet
Unnamed: 0_level_1,Sales/Revenue,Sales/Revenue,Sales/Revenue,Gross Income/Profit,Gross Income/Profit,Gross Income/Profit,Total Assets,Total Assets,Total Assets,Total Liabilities,Total Liabilities,Total Liabilities
Unnamed: 0_level_2,Apple,Amazon,Alphabet,Apple,Amazon,Alphabet,Apple,Amazon,Alphabet,Apple,Amazon,Alphabet
2018-06-30,53427,52886,32758,20789,22254,18875,375319,134100,197295,241272,99105,44793
2018-03-31,61224,51042,31393,23530,20307,17926,321686,126362,167497,193437,94899,28461
2017-12-31,88477,60453,32521,34069,21959,18254,290479,131310,147461,171124,103601,27130
2017-09-30,52574,43744,27963,19902,16195,16815,231839,115267,129187,120292,90609,25327
2017-06-30,45260,37955,26007,17267,14504,15634,207000,87781,110920,83451,64567,23611


In [55]:
findata2.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Balance Sheet,Balance Sheet,Income Statement,Income Statement
Unnamed: 0_level_1,Unnamed: 1_level_1,Total Assets,Total Liabilities,Gross Income/Profit,Sales/Revenue
2018-06-30,Alphabet,197295,44793,18875,32758
2018-06-30,Amazon,134100,99105,22254,52886
2018-06-30,Apple,375319,241272,20789,53427
2018-03-31,Alphabet,167497,28461,17926,31393
2018-03-31,Amazon,126362,94899,20307,51042
2018-03-31,Apple,321686,193437,23530,61224
2017-12-31,Alphabet,147461,27130,18254,32521
2017-12-31,Amazon,131310,103601,21959,60453
2017-12-31,Apple,290479,171124,34069,88477
2017-09-30,Alphabet,129187,25327,16815,27963


In [58]:
findata2.stack(level=2) #level 2 is the "inner-most" default, so we obtain the same result than above

Unnamed: 0_level_0,Unnamed: 1_level_0,Balance Sheet,Balance Sheet,Income Statement,Income Statement
Unnamed: 0_level_1,Unnamed: 1_level_1,Total Assets,Total Liabilities,Gross Income/Profit,Sales/Revenue
2018-06-30,Alphabet,197295,44793,18875,32758
2018-06-30,Amazon,134100,99105,22254,52886
2018-06-30,Apple,375319,241272,20789,53427
2018-03-31,Alphabet,167497,28461,17926,31393
2018-03-31,Amazon,126362,94899,20307,51042
2018-03-31,Apple,321686,193437,23530,61224
2017-12-31,Alphabet,147461,27130,18254,32521
2017-12-31,Amazon,131310,103601,21959,60453
2017-12-31,Apple,290479,171124,34069,88477
2017-09-30,Alphabet,129187,25327,16815,27963


In [59]:
findata2.stack(level=0) #level 0 is the "financial report" level, this time that level gets "stacked"

Unnamed: 0_level_0,Unnamed: 1_level_0,Gross Income/Profit,Gross Income/Profit,Gross Income/Profit,Sales/Revenue,Sales/Revenue,Sales/Revenue,Total Assets,Total Assets,Total Assets,Total Liabilities,Total Liabilities,Total Liabilities
Unnamed: 0_level_1,Unnamed: 1_level_1,Alphabet,Amazon,Apple,Alphabet,Amazon,Apple,Alphabet,Amazon,Apple,Alphabet,Amazon,Apple
2018-06-30,Balance Sheet,,,,,,,197295.0,134100.0,375319.0,44793.0,99105.0,241272.0
2018-06-30,Income Statement,18875.0,22254.0,20789.0,32758.0,52886.0,53427.0,,,,,,
2018-03-31,Balance Sheet,,,,,,,167497.0,126362.0,321686.0,28461.0,94899.0,193437.0
2018-03-31,Income Statement,17926.0,20307.0,23530.0,31393.0,51042.0,61224.0,,,,,,
2017-12-31,Balance Sheet,,,,,,,147461.0,131310.0,290479.0,27130.0,103601.0,171124.0
2017-12-31,Income Statement,18254.0,21959.0,34069.0,32521.0,60453.0,88477.0,,,,,,
2017-09-30,Balance Sheet,,,,,,,129187.0,115267.0,231839.0,25327.0,90609.0,120292.0
2017-09-30,Income Statement,16815.0,16195.0,19902.0,27963.0,43744.0,52574.0,,,,,,
2017-06-30,Balance Sheet,,,,,,,110920.0,87781.0,207000.0,23611.0,64567.0,83451.0
2017-06-30,Income Statement,15634.0,14504.0,17267.0,26007.0,37955.0,45260.0,,,,,,


In [60]:
findata2.stack().stack() # lets stack twice, two inner-most levels get stacked.

Unnamed: 0,Unnamed: 1,Unnamed: 2,Balance Sheet,Income Statement
2018-06-30,Alphabet,Gross Income/Profit,,18875.0
2018-06-30,Alphabet,Sales/Revenue,,32758.0
2018-06-30,Alphabet,Total Assets,197295.0,
2018-06-30,Alphabet,Total Liabilities,44793.0,
2018-06-30,Amazon,Gross Income/Profit,,22254.0
2018-06-30,Amazon,Sales/Revenue,,52886.0
2018-06-30,Amazon,Total Assets,134100.0,
2018-06-30,Amazon,Total Liabilities,99105.0,
2018-06-30,Apple,Gross Income/Profit,,20789.0
2018-06-30,Apple,Sales/Revenue,,53427.0


# HERE: continue with reshaping now using these two methods below. Use the above example

Below are the methods used to rearrange the order of the levels on an axis or sort the data by the values in one specific level. 

* frame.swaplevel() : The swaplevel takes two level numbers or names and returns a new object with the levels interchanged. 
* frame.sort_index() : This method sorts the data using only the values in a single level. <br>
** Note: ** When swapping levels, it is not uncommon to also use sort_index so that the result is lexicographically sorted by the indicated level. For instance, <code>frame.swaplevel(0,1).sort_index(level=0)</code>. 


In [93]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [94]:
frame.swaplevel(0,1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [95]:
frame.swaplevel(0,1).sort_index(level=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


# Aggregate, Group, Filter and Transform Data

Aggregation Methods:
* df.count() : This Method counts total number of items
* df.first(), df.last() : Methods to get first and last item respectively.
* df.mean(), df.median() : Methods to get mean and median respectively.
* df. min(), df.max() : Methods to get minimum and maximum values respectively.
* df.std(), df.var(): Methods to get standard deviation and variance respectively.
* df.mad() : Method to get mean absolute deviation 
* df.prod() : Method to get product of the all items
* df.sum() : Method to get sum of all the items.

Groupby: Split, Apply ,Combine
* Split : Breaks and groups data frame depending on the value of the specified key.
* Apply : Computes some functions, usually an aggregate, transformation or filtering, within individual groups.
* Combine : Merges the results of these operations into an output array.

In [96]:
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'key' : [ 'A', 'B', 'C', 'A', 'B', 'C'], 'data1': range(6), 'data2': rng.randint(0, 10, 6)})
df


Unnamed: 0,data1,data2,key
0,0,5,A
1,1,0,B
2,2,3,C
3,3,3,A
4,4,7,B
5,5,9,C


In [43]:
#Aggregation
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


In [44]:
#Grouping by key using the mean to aggregate
df.groupby('key').mean()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1.5,4.0
B,2.5,3.5
C,3.5,6.0


In [45]:
#Filtering by key using mean. Only include data if data2 mean for a group (A,B,C) is more than or equal to 4.
def filter_func(x):
    return x['data2'].mean() >= 4

df.groupby('key').filter(filter_func)

Unnamed: 0,data1,data2,key
0,0,5,A
2,2,3,C
3,3,3,A
5,5,9,C


While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.

In [46]:
# Transformation

df.groupby('key').transform(lambda x: x - x.mean() )



Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


**Apply()** : <br>

This method lets us apply an arbitrary function to the group results. This method should take a data frame and return either a Pandas object(e.g., DataFrame series) or a scalar; 



In [47]:
df

Unnamed: 0,data1,data2,key
0,0,5,A
1,1,0,B
2,2,3,C
3,3,3,A
4,4,7,B
5,5,9,C


In [48]:
def norm_by_data2(x):
    x['data1'] /= x['data2'].sum()
    return x

df.groupby('key').apply(norm_by_data2)

Unnamed: 0,data1,data2,key
0,0.0,5,A
1,0.142857,0,B
2,0.166667,3,C
3,0.375,3,A
4,0.571429,7,B
5,0.416667,9,C


# Merge and Join Data

To combine datasets by linking rows using one or more keys, we use **Merge** or **Join** operations. <br>

In [100]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a',  'b'], 'data1': range(7)})
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [101]:
df2 = pd.DataFrame ({'key': [ 'a', 'b', 'd'], 'data2': range(3)})
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


In [51]:

pd.merge(df1, df2) 

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


This **Merge** merges by using the overlapping column names as the keys. This is because we did not specify which column to **join** explicitly. So to overcome this problem we use 



In [52]:
pd.merge(df1, df2, on = 'key')


Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


By default the behaviour of **join** is inner here, however that can also be modified as follows:

Other options : <br>
* Inner : Use only the key combinations observed in both tables.
* Outer : Use all the key combinations observed in both tables together.
* Left : Use all key combinations found in the left table.
* Right : Use all key combinations found in the right table.


In [53]:
pd.merge(df1, df2, how = 'outer')

Unnamed: 0,data1,key,data2
0,0.0,b,1.0
1,1.0,b,1.0
2,6.0,b,1.0
3,2.0,a,0.0
4,4.0,a,0.0
5,5.0,a,0.0
6,3.0,c,
7,,d,2.0


In [99]:
pd.merge(df1, df2, how = 'right')

Unnamed: 0,data1,key,data2
0,0.0,b,1
1,1.0,b,1
2,6.0,b,1
3,2.0,a,0
4,4.0,a,0
5,5.0,a,0
6,,d,2


Concatenation is another operation used for data combination. This operation can be performed using numpy function as below:

In [54]:
arr = np.arange(12).reshape((3, 4))
print(arr)
np.concatenate([arr, arr], axis =1)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

The concat function in pandas provides a consistent way, specially when there is more than one axis which insufficient data.



In [55]:
s1 = pd.Series([0,1], index = ['a', 'b'])
s2 = pd.Series([ 2, 3, 4], index = ['c', 'd', 'e'])
s3 = pd.Series([5, 6], index = ['f', 'g'] )
print('S1 : ')
print(s1)
print('S2 : ')
print(s2)
print('S3 : ')
print(s3)


S1 : 
a    0
b    1
dtype: int64
S2 : 
c    2
d    3
e    4
dtype: int64
S3 : 
f    5
g    6
dtype: int64


In [56]:
pd.concat([s1, s2, s3], axis = 1, sort=True)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


Now let us define the behaviour of **join**

In [57]:
s4 = pd.concat([s1, s3])
pd.concat([s1, s4], axis = 1, join = 'inner')


Unnamed: 0,0,1
a,0,0
b,1,1


# Pivot Tables

A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The pivot table takes simple column-wise data as input and groups the entries into a two dimensional table that provides a multidimensional summarization of the data. <br>

In [58]:
import numpy as np
import pandas as pd
import seaborn as sns # statistical data visualization
titanic = sns.load_dataset('titanic') # load_dataset looks for online csv files 
                                    #on https://github.com/mwaskom/seaborn-data
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.0750,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


In [59]:
titanic.pivot_table(values=['survived','fare'], index = 'sex', columns = 'class')

Unnamed: 0_level_0,fare,fare,fare,survived,survived,survived
class,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,106.125798,21.970121,16.11881,0.968085,0.921053,0.5
male,67.226127,19.741782,12.661633,0.368852,0.157407,0.135447


The aggfunc keyword controls what type of aggregation is applied, which is mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices(‘sum’, ‘mean’, ‘count’, ‘min’, ‘max’, etc.) or a function that implements an aggregation (np.sum(), min(), sum(), etc.).

Additionally, it can be specified as dictionary mapping a column to any of the above desired options: 

In [60]:
titanic.pivot_table(index = 'sex', columns = 'class', aggfunc = {'survived' : sum, 'fare' : 'mean'})

Unnamed: 0_level_0,fare,fare,fare,survived,survived,survived
class,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,106.125798,21.970121,16.11881,91,70,72
male,67.226127,19.741782,12.661633,45,17,47


# Data cleaning and preparation

Data cleaning is the process of removing bad data in a dataset. This bad data includes incorrect and improperly formatted data as well as duplicated and missing data. 

In [61]:
# Example of a student survey dataset which includes incorrect and improperly formatted data.

csv_path = 'https://s3.amazonaws.com/dss-fall2018/Student_Survey.csv'

df = pd.read_csv (csv_path)
df

Unnamed: 0,Year,Location,Education,Sample_Size,Satisfactory
0,2017.0,Putnam,Middle School,659.0,Y
1,2018.0,Lexington,Middle School,649.0,N
2,2018.0,Lexington,Middle School,435.0,N
3,2017.0,Berkeley,,,
4,,Berkeley,High School,228.0,Y
5,2018.0,Berkeley,Middle School,20.0,
6,2018.0,Washington,High School,437.0,N
7,,Tremont,High School,,Y
8,2016.0,Tremont,High School,220.0,Y


Let's take a look at the dataset. There are seven NA values in all columns. By using ```isnull``` function, pandas recognizes all missing value and return ```True```  

In [62]:
# Recognizing missing values

print (df.isnull())


    Year  Location  Education  Sample_Size  Satisfactory
0  False     False      False        False         False
1  False     False      False        False         False
2  False     False      False        False         False
3  False     False       True         True          True
4   True     False      False        False         False
5  False     False      False        False          True
6  False     False      False        False         False
7   True     False      False         True         False
8  False     False      False        False         False


### ```dropna``` method

```dropna``` is a method to filter missing data. Sometimes you need to work on only correct data and want to omit others.  

In [63]:
# Dropping all missing data by omitting rows and columns which include missing data

df.dropna()

Unnamed: 0,Year,Location,Education,Sample_Size,Satisfactory
0,2017.0,Putnam,Middle School,659.0,Y
1,2018.0,Lexington,Middle School,649.0,N
2,2018.0,Lexington,Middle School,435.0,N
6,2018.0,Washington,High School,437.0,N
8,2016.0,Tremont,High School,220.0,Y


In [65]:
# Dropping rows and columns which are all NA (In this example, there is no row includes all NA)

df.dropna(how = 'all')    #For rows
df.dropna(axis = 1, how ='all')     #For columns

Unnamed: 0,Year,Location,Education,Sample_Size,Satisfactory
0,2017.0,Putnam,Middle School,659.0,Y
1,2018.0,Lexington,Middle School,649.0,N
2,2018.0,Lexington,Middle School,435.0,N
3,2017.0,Berkeley,,,
4,,Berkeley,High School,228.0,Y
5,2018.0,Berkeley,Middle School,20.0,
6,2018.0,Washington,High School,437.0,N
7,,Tremont,High School,,Y
8,2016.0,Tremont,High School,220.0,Y


### ```fillna``` method 

```fillna``` is a method to fill missing data by any number or value.  

In [66]:
# filling missing data by '100' in 'Sample_Size' column. 

print (df['Sample_Size'])
df['Sample_Size'].fillna(100)

0    659.0
1    649.0
2    435.0
3      NaN
4    228.0
5     20.0
6    437.0
7      NaN
8    220.0
Name: Sample_Size, dtype: float64


0    659.0
1    649.0
2    435.0
3    100.0
4    228.0
5     20.0
6    437.0
7    100.0
8    220.0
Name: Sample_Size, dtype: float64

In [67]:
# filling missing data in column 'Year' by 2015.

df.fillna({'Year': 2015})

Unnamed: 0,Year,Location,Education,Sample_Size,Satisfactory
0,2017.0,Putnam,Middle School,659.0,Y
1,2018.0,Lexington,Middle School,649.0,N
2,2018.0,Lexington,Middle School,435.0,N
3,2017.0,Berkeley,,,
4,2015.0,Berkeley,High School,228.0,Y
5,2018.0,Berkeley,Middle School,20.0,
6,2018.0,Washington,High School,437.0,N
7,2015.0,Tremont,High School,,Y
8,2016.0,Tremont,High School,220.0,Y


### Removing Duplicates

Sometimes in a DataFrame, you have duplicate rows and you need to remove duplicate.

For having a sample of DataFrame with duplicate rows, let's duplicate one of the rows in the previous example (like row number 6):

In [72]:
# Creating a DataFrame with two same rows
raw_data = [['Bruce','Banner',38,4,25],['Tony','Stark',42,24,94],['Hal','Jordan',25,31,57],['Bruce','Wayne',32,2,62],
            ['Clark','Kent',28,3,70],['Tony','Stark',42,24,94]]
df = pd.DataFrame (raw_data, columns = ['first_name', 'last_name','age','preTestScore','postTestScore'])
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Bruce,Banner,38,4,25
1,Tony,Stark,42,24,94
2,Hal,Jordan,25,31,57
3,Bruce,Wayne,32,2,62
4,Clark,Kent,28,3,70
5,Tony,Stark,42,24,94


```duplicated()``` method returns a boolean value whether each row is a duplicate. In this DataFrame, the value for number 5 is 'True' and says row number 5 is duplicated.

In [73]:
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

```drop_duplicates()``` method returns DataFrame with duplicate rows removed

In [74]:
df.drop_duplicates()

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Bruce,Banner,38,4,25
1,Tony,Stark,42,24,94
2,Hal,Jordan,25,31,57
3,Bruce,Wayne,32,2,62
4,Clark,Kent,28,3,70


# References
1. Seaborn, statistical data visualization, https://seaborn.pydata.org
2. Data repo for Seaborn examples, https://github.com/mwaskom/seaborn-data
3. Pivot Tables, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
4. Reshaping, Pivot, Stack, unstack, https://pandas.pydata.org/pandas-docs/stable/reshaping.html