### Operating on Data in Pandas
* **Unary operations**, e.g. negation and trigonometric functions, these ufuncs will preserve index and columan lables in the output.
* **Binary operations**, e.g. addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.
    * <font color = red> Keeping the context of data and combining data from different sources</font>

In [1]:
import numpy as np
import pandas as pd

#### Ufuncs: Index Preservation

In [2]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [4]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                 columns = ['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,1,7,5,1
1,4,0,9,5
2,8,0,9,2


In [5]:
# indices preserved
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [6]:
# indices preserved
np.sin(df* np.pi/4)

Unnamed: 0,A,B,C,D
0,0.7071068,-0.707107,-0.707107,0.707107
1,1.224647e-16,0.0,0.707107,-0.707107
2,-2.449294e-16,0.0,0.707107,1.0


### UFuncs: Index Alignment
* For binary operations, Pandas will align indices in the process of performing the operation

* Index alignment in Series
    * The resulting array contains the union of the indices of the two input arrays
    * Any item for which one or the other does not have an entry is marked with NaN.
        * <font color = red> Pandas marks missing data as NaN </font>
    * If NaN is not desired, fill value using methods.
        * A.add(other, level=None, fill_value=None, axis=0)
            * fill_value: Filling missing (NaN) values with this value. If both Series are missing, the result will be missing.

In [8]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')  # name
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [10]:
area.name, population.name

('area', 'population')

In [11]:
# The resulting array contains the union of the indices of the two input arrays
population/area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [12]:
A = pd.Series([2, 4, 6], index = [0, 1, 2])
B = pd.Series([1, 3, 5], index = [1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [14]:
A.add(B, fill_value = 0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

* Index alignment in DataFrame
    * Indices are aligned correctly irrespective of their order in the two objects
    * indices in the result are sorted.
    
<img src = "files/pandas_methods.PNG" width = 500>

In [15]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,11,19
1,2,4


In [17]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                columns = list('BAC'))
B

Unnamed: 0,B,A,C
0,9,8,9
1,4,1,3
2,6,7,2


In [18]:
A + B
# Indices are aligned correctly irrespective of their order in the two objects
# indices in the result are sorted.

Unnamed: 0,A,B,C
0,19.0,28.0,
1,3.0,8.0,
2,,,


In [23]:
fill = A.stack().mean() # first stacking the row of A, then calculate the mean.
A.add(B, fill_value = fill)

Unnamed: 0,A,B,C
0,19.0,28.0,18.0
1,3.0,8.0,12.0
2,16.0,15.0,11.0


In [30]:
A.stack().mean()

9.0

### Ufuncs: Operations between DataFrame and Series
* Convention (np braodcasting rules) similarly operates row-wise by default
* For operating column-wise, use the object methods, while specifying the *axis* keyword.
    * df.subtract(other, axis='columns', level=None, fill_value=None)
        * other : Series, DataFrame, or constant
        * axis : {0, 1, 'index', 'columns'}
        * For Series input, axis to match Series index on
* Operations will automatically align indices between the two elements

In [31]:
A = rng.randint(10, size=(3, 4))
A

array([[0, 3, 1, 7],
       [3, 1, 5, 5],
       [9, 3, 5, 1]])

In [32]:
A-A[0]

array([[ 0,  0,  0,  0],
       [ 3, -2,  4, -2],
       [ 9,  0,  4, -6]])

In [33]:
# Convention (np braodcasting rules) similarly operates row-wise by default
df = pd.DataFrame(A, columns = list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,0,3,1,7
1,3,1,5,5
2,9,3,5,1


In [34]:
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,3,-2,4,-2
2,9,0,4,-6


In [41]:
# For operating column-wise, use the object methods, while specifying the axis keyword.
# df.subtract(other, axis='columns', level=None, fill_value=None)
df.subtract(df['R'], axis = 0)

Unnamed: 0,Q,R,S,T
0,-3,0,-2,4
1,2,0,4,4
2,6,0,2,-2


In [44]:
# Operations will automatically align indices between the two elements
halfrow = df.iloc[0, ::2]
halfrow

Q    0
S    1
Name: 0, dtype: int32

In [45]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,3.0,,4.0,
2,9.0,,4.0,


### Handling Missing Data: *null*, *NaN*, or *NA*
#### Trade-Offs in Missing Data Conventions
* Two general strategies:
    1. Using a *mask* that globally indicates missing values.
        * might be an entirely separate Boolean array
        * may involve appropriation of one bit in the data presentation to locally indicate the null status of a value.
        * require alocation of an additional Boolean array: overhead in storage and computation
    2. choosing a *sentinel value* that indicates a missing entry.
        * Could be some data-specific convention, e.g. -9999 or some rare bit pattern.
        * Could be a more global convention, e.g. a missing fp value with NaN
        * Reduce the range of valid values that can be represented, or extra logic in CPU and GPU.
        * Comman special values like NaN are not available for all data types.

#### Missing Data in Pandas
* Pandas chooses to use sentinels for missing data, and further chose to use two already-existing Python null values: <mark> the special floating point NaN value, and the Python None object.</mark>

* **None**: Pythonic missing data
    * The first sentinel value used by Pandas is **None**
    * **None** is a <mark>Python singleton object</mark>:
        * it cannot be used in any arbitrary np/pd array, but only in arrays with data type 'object'
        * This **dtype = object** means that the best common type representation np could infer for the contents of the array is that they are Python objects.
        * Any operations on the data will be done at the Python level, with over head
        * performing aggregations across an array with a None value will get an error

In [47]:
vals1 = np.array([1, None, 3, 4])
vals1

# This dtype = object means that the best common type representation np could infer 
# for the contents of the array is that they are Python objects.

array([1, None, 3, 4], dtype=object)

In [48]:
# Any operations on the data will be done at the Python level, with over head
for dtype in ['object', 'int']:
    print('dtype =', dtype)
    %timeit np.arange(1E6, dtype = dtype).sum()
    print()

dtype = object
69 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
2.17 ms ± 35.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



In [49]:
# The use of Python objects in an array also means that if you perform aggregations
# like sum() or min() across an array with a None value, you will generally get an error
vals1.sum()
# TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

#### NaN: Missing numerical data
* **NaN**
    * acronym for *Not a Number*
    * a special fp value recognized by all systems that use the standard IEEE fp representation
        * <mark>no equivalent NaN value for integer, strings, or other types</mark>
    * it infects any other object it touches
        * the result of arithmetic with NaN will be another NaN
        * aggregates over the values are well defined but not always useful.
            * np provide some special aggregations that ignore these missing values

In [52]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

In [53]:
# NaN is a bit like a data virus—it infects any other object it touches.
# Regardless of the operation, the result of arithmetic with NaN will be another NaN
1 + np.nan

nan

In [55]:
# NaN is a bit like a data virus—it infects any other object it touches.
# Regardless of the operation, the result of arithmetic with NaN will be another NaN
0 * np.nan

nan

In [56]:
# aggregates over the values are well defined but not always useful.
vals2.sum(), vals2.min(), vals2.max()

  return umr_minimum(a, axis, None, out, keepdims)
  return umr_maximum(a, axis, None, out, keepdims)


(nan, nan, nan)

In [57]:
# np provide some special aggregations that ignore these missing values
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

#### NaN and None in Pandas
* Pandas handles NaN and None nearly interchangeably, converting between them where appropriate.
    * For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present
<img src = "files/pandas_na.PNG" width = 300>

In [58]:
# Pandas handles NaN and None nearly interchangeably, converting between them where appropriate.
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [60]:
# For types that don't have an available sentinel value, 
# Pandas automatically type-casts when NA values are present
# Automatically be upcast to a fp type if we set a value in an int array to np.nan
x = pd.Series(range(2), dtype = int)
x

0    0
1    1
dtype: int32

In [61]:
x[0] = None
x #Automatically converts the None to a NaN value.

0    NaN
1    1.0
dtype: float64

### Operating on Null Values
* Pandas treats None and NaN as essentially interchangeable for indicating missing or null values.
* Several useful methods for detecting, removing, and replacing null values
    * data.isnull(): return a boolean same-sized object indicating if the values are NA
    * data.notnull(): Return a boolean same-sized object indicating if the values are not NA.
    * data.dropna(axis=0, inplace=False, **kwargs): Return Series without null values
    * data.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

In [2]:
data = pd.Series([1, np.nan, 'hello', None])

In [65]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [67]:
data.notnull()

0     True
1    False
2     True
3    False
dtype: bool

In [69]:
data.isnull() & data.notnull()

0    False
1    False
2    False
3    False
dtype: bool

In [4]:
data.dropna()

0        1
2    hello
dtype: object

In [3]:
data.fillna?

* Dropping null values
    * We can't drop single values from a df; we can only drop full rows or full columns.
    * df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
        * axis: {0 or 'index', 1 or 'columns'}, or tuple/list thereof. 
            * Pass tuple or list to drop on multiple axes.
        * how: {'any', 'all'}
            * any: if any NA values are present, drop that **label**
            * all: if all values are NA, drop that **label**
        * thresh: int, default none
            * int value: require that many non-NA values
        * subset: array-like
            * Labels along other axis to consider, e.g. if you're dropping rows
            * These would be a list of columns to include
        * inplace: boolean, default False
            * If True, do operation inplace and return None

In [22]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5]],
                   columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5


In [7]:
# Drop the columns where all elements are nan:
df.dropna(axis = 1, how = 'all')

Unnamed: 0,A,B,D
0,,2.0,0
1,3.0,4.0,1
2,,,5


In [9]:
# Drop the columns where any of the elements is NaN:
df.dropna(axis = 1, how = 'any')

Unnamed: 0,D
0,0
1,1
2,5


In [10]:
# Drop the columns where all elements are nan:
df.dropna(axis = 0, how = 'all')

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5


In [11]:
# Keep only the rows with at least 2 non-na values:
df.dropna(axis = 0, thresh = 2)

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1


* Filling null values
    * Replace NA values with a valid value.
    * Using the **isnull**() method as a mask.
    * fillna() method: return a copy of the array with the null values replaced.
    * df.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
        * value: scalar, dict, Series, or df
        * method: {'backfill', 'bfill', 'pad', 'ffill', None}, default None
            * Method to use for filling holes in reindexed Series
            * pad/ffill: propagate last valid observation forward to next valid
            * backfill/bfill: use NEXT valid observation to fill gap
        * axis: {0 or 'index', 1 or 'columns'}
        * inplace: boolean, default Fasle
            * If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy lice for a column in a DataFrame)
        * limit: int, default None
            * If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill.
            * In other words, if there is a gap with more than this number of consecutive NaHs, it will only be partially filled. 
            * If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.
            * Must be greater than 0 if not None.
            

In [13]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5


In [14]:
# Replace all NaN elements with 0s.
df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5


In [16]:
# forward fill
df.fillna(method = 'ffill')

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5


In [17]:
# backford fill
df.fillna(method = 'bfill')

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,,,,5


In [18]:
# Replace all NaN elements in colum 'A', 'B', 'C', and 'D', with 0, 1
values = {'A':0, 'B':1, 'C':2, 'D': 3}
df.fillna(value = values)

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0
1,3.0,4.0,2.0,1
2,0.0,1.0,2.0,5


In [19]:
# Only replace the first NaN element.
df.fillna(value = values, limit = 1)

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0
1,3.0,4.0,,1
2,,1.0,,5


In [21]:
df.fillna(0, inplace = True)
df

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
