# Python Pandas

* 1 - Handling missing data
  * 1.1 - Filtering out missing data
  * 1.2 - Filling in missing data
    * 1.2.1 - `ffill` method for interpolation
    * 1.2.2 - `bfill` method for interpolation
    * 1.2.3 - Alternative fillings
* 2 - Data Transformation
  * 2.1 - Remove Duplicates
  * 2.2 - Data Mapping
  
[Official Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)

In [1]:
import pandas as pd
import numpy as np

In [2]:
Portfolio = pd.DataFrame({'Fund':['A', 'B',    'C', 'D',    'E' , np.nan],
                          'NPV': [1345, np.nan, 864, 1548,   952, np.nan],
                          'MV' : [1355, 764,    871, np.nan, 941, np.nan]
                         })
Portfolio

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
1,B,,764.0
2,C,864.0,871.0
3,D,1548.0,
4,E,952.0,941.0
5,,,


In [3]:
Portfolio.isnull()

Unnamed: 0,Fund,NPV,MV
0,False,False,False
1,False,True,False
2,False,False,False
3,False,False,True
4,False,False,False
5,True,True,True


In [4]:
Portfolio[['NPV']].isnull()

Unnamed: 0,NPV
0,False
1,True
2,False
3,False
4,False
5,True


**NA handling methods** - There are four main ways to handle with arguments.

|Argument|Description|
|--------|-----------|
|dropna  |Filter out missing data|
|fillna  |Fill in missing values either 'ffill' or 'bfill'|
|isnull  |Boolean for missing values|
|notnull |Negation of isnull|


## 1.1 - Filtering out missing data

From *Portfolio* dataframe use `dropna()` function. By defalt this function will remove any row, as long as there are rows 

In [5]:
Portfolio

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
1,B,,764.0
2,C,864.0,871.0
3,D,1548.0,
4,E,952.0,941.0
5,,,


In [6]:
Portfolio.dropna()

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
2,C,864.0,871.0
4,E,952.0,941.0


Passing `how = 'all'` to only drop rows will all NA:

In [7]:
Portfolio.dropna(how = 'all')

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
1,B,,764.0
2,C,864.0,871.0
3,D,1548.0,
4,E,952.0,941.0


Drop columns in the same way, *pass* axis=1

In [8]:
Portfolio2 = pd.DataFrame({'Fund':['A', 'B',    'C', 'D',    'E' ],
                           'NPV': [1345, np.nan, 864, 1548,   952]
                          })
Portfolio2

Unnamed: 0,Fund,NPV
0,A,1345.0
1,B,
2,C,864.0
3,D,1548.0
4,E,952.0


In [9]:
Portfolio2.dropna(axis = 1)

Unnamed: 0,Fund
0,A
1,B
2,C
3,D
4,E


In [10]:
Portfolio.T

Unnamed: 0,0,1,2,3,4,5
Fund,A,B,C,D,E,
NPV,1345,,864,1548,952,
MV,1355,764,871,,941,


From the example above remove columns that only have NA in every, i.e. the last column.

In [11]:
Portfolio2.T.dropna(axis = 1, how = 'all')

Unnamed: 0,0,1,2,3,4
Fund,A,B,C,D,E
NPV,1345,,864,1548,952


## 1.2 - Filling in Missing Data

It is possible to fill in "holes" in different ways. the *fillna* method is the most used

Replace NAs with 0

In [12]:
Portfolio.fillna(0)

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
1,B,0.0,764.0
2,C,864.0,871.0
3,D,1548.0,0.0
4,E,952.0,941.0
5,0,0.0,0.0


Replace NAs with diffrent values for each column

In [13]:
Portfolio.fillna({'NPV': 500,
                  'MV': 600 })

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
1,B,500.0,764.0
2,C,864.0,871.0
3,D,1548.0,600.0
4,E,952.0,941.0
5,,500.0,600.0


### 1.2.1 - `ffill` method for interpolation

It will copy the values from the row above. `ffill` stands for forward fill.

In [14]:
Portfolio.fillna(method = 'ffill')

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
1,B,1345.0,764.0
2,C,864.0,871.0
3,D,1548.0,871.0
4,E,952.0,941.0
5,E,952.0,941.0


### 1.2.2 - `bfill` method for interpolation

It will copy the values from the row below. `bfill` stands for back fill.

In [15]:
Portfolio.fillna(method = 'bfill')

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
1,B,864.0,764.0
2,C,864.0,871.0
3,D,1548.0,941.0
4,E,952.0,941.0
5,,,


### 1.2.3 - Alternatives filling

Mean or median can be used to fill in the NAs. Funds B for NPV and Fund D for MV and the last row for both NPV and MV.

In [16]:
Portfolio.fillna(Portfolio.mean())

Unnamed: 0,Fund,NPV,MV
0,A,1345.0,1355.0
1,B,1177.25,764.0
2,C,864.0,871.0
3,D,1548.0,982.75
4,E,952.0,941.0
5,,1177.25,982.75


# 2 - Data Tranformation

## 2.1 - Remove Duplicates

On below Dataframe Fund E is duplicate. 

In [17]:
Portfolio3 = pd.DataFrame({'Fund':['A', 'B', 'C', 'D',  'E' , 'E', 'F'],
                          'NPV':  [1345, 954, 864, 1548, 952,  952, 1177]
                         })
Portfolio3

Unnamed: 0,Fund,NPV
0,A,1345
1,B,954
2,C,864
3,D,1548
4,E,952
5,E,952
6,F,1177


In [18]:
Portfolio3.duplicated()

0    False
1    False
2    False
3    False
4    False
5     True
6    False
dtype: bool

In [19]:
Portfolio3.drop_duplicates()

Unnamed: 0,Fund,NPV
0,A,1345
1,B,954
2,C,864
3,D,1548
4,E,952
6,F,1177


## 2.2 - Data Mapping

On example below there is a case of three funds, and we goal is to match the `Fund Code`, to the `Fund Name`.

(Below approach cold also be done with a pd.merge function, how = 'left')

In [20]:
Funds = pd.DataFrame({'Fund Code' :    ['Fund A',   'Fund B',  'Fund A',     'Fund C',  'Fund C'],
                      'Position':      ['Food Co.', 'Tech Co', 'Housing Co', 'Tech Co', 'Cloth Co'],
                      'Quantity':      [100,         20,        125,          50,        75],
                      'Market Value' : [1500,        500,       2250,         750,       800]
                     })

Funds

Unnamed: 0,Fund Code,Position,Quantity,Market Value
0,Fund A,Food Co.,100,1500
1,Fund B,Tech Co,20,500
2,Fund A,Housing Co,125,2250
3,Fund C,Tech Co,50,750
4,Fund C,Cloth Co,75,800


In [21]:
Code_to_name = {
    'Fund A': 'Value Fund',
    'Fund B': 'Growth Fund',
    'Fund C': 'Mixed Fund'
}

In [22]:
data = Funds['Fund Code']
data

0    Fund A
1    Fund B
2    Fund A
3    Fund C
4    Fund C
Name: Fund Code, dtype: object

In [23]:
Funds['Fund Name'] = data.map(Code_to_name)
Funds

Unnamed: 0,Fund Code,Position,Quantity,Market Value,Fund Name
0,Fund A,Food Co.,100,1500,Value Fund
1,Fund B,Tech Co,20,500,Growth Fund
2,Fund A,Housing Co,125,2250,Value Fund
3,Fund C,Tech Co,50,750,Mixed Fund
4,Fund C,Cloth Co,75,800,Mixed Fund


In [24]:
Funds[['Fund Code', 'Fund Name', 'Position', 'Quantity', 'Market Value']]

Unnamed: 0,Fund Code,Fund Name,Position,Quantity,Market Value
0,Fund A,Value Fund,Food Co.,100,1500
1,Fund B,Growth Fund,Tech Co,20,500
2,Fund A,Value Fund,Housing Co,125,2250
3,Fund C,Mixed Fund,Tech Co,50,750
4,Fund C,Mixed Fund,Cloth Co,75,800


Alternative approach using `merge` function

In [25]:
Funds = pd.DataFrame({'Fund Code' :    ['Fund A',   'Fund B',  'Fund A',     'Fund C',  'Fund C'],
                      'Position':      ['Food Co.', 'Tech Co', 'Housing Co', 'Tech Co', 'Cloth Co'],
                      'Quantity':      [100,         20,        125,          50,        75],
                      'Market Value' : [1500,        500,       2250,         750,       800]
                     })

Designations = pd.DataFrame({'Fund Code' : ['Fund A',     'Fund B',      'Fund C'],
                             'Fund Name' : ['Value Fund', 'Growth Fund', 'Mixed Fund']
                            })
Funds

Unnamed: 0,Fund Code,Position,Quantity,Market Value
0,Fund A,Food Co.,100,1500
1,Fund B,Tech Co,20,500
2,Fund A,Housing Co,125,2250
3,Fund C,Tech Co,50,750
4,Fund C,Cloth Co,75,800


In [26]:
Designations

Unnamed: 0,Fund Code,Fund Name
0,Fund A,Value Fund
1,Fund B,Growth Fund
2,Fund C,Mixed Fund


In [27]:
pd.merge(Funds, Designations, on = 'Fund Code', how = 'left')[['Fund Code', 'Fund Name', 'Position', 'Quantity', 'Market Value']]

Unnamed: 0,Fund Code,Fund Name,Position,Quantity,Market Value
0,Fund A,Value Fund,Food Co.,100,1500
1,Fund B,Growth Fund,Tech Co,20,500
2,Fund A,Value Fund,Housing Co,125,2250
3,Fund C,Mixed Fund,Tech Co,50,750
4,Fund C,Mixed Fund,Cloth Co,75,800
