# Pandas 

Playing along with Pandas (sadly not the furry ones)

In [1]:
import numpy as np
import pandas as pd



## Series, DataFrames and Indexes

In [9]:
S = pd.Series(np.linspace(0,1,5))
S

0    0.00
1    0.25
2    0.50
3    0.75
4    1.00
dtype: float64

In [25]:
S.values, S.index

(array([0.  , 0.25, 0.5 , 0.75, 1.  ]), RangeIndex(start=0, stop=5, step=1))

- Series objects can be extended eg `S[5] = 1.25` would add to the above series

In [13]:
population_dict = {'California': 38332521, 'Texas': 26448193,
                               'New York': 19651127,
                               'Florida': 19552860,
                               'Illinois': 12882135}
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
                 'Florida': 170312, 'Illinois': 149995}

D = pd.DataFrame({'population': population_dict, 'area': area_dict})
D

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [14]:
D.index, D.columns

(Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object'),
 Index(['population', 'area'], dtype='object'))

In [27]:
I1 = pd.Index(range(5))
I2 = pd.Index([c for c in "abcde"])
I3 = pd.Index(range(3,8))

I1, I2, I3 

(RangeIndex(start=0, stop=5, step=1),
 Index(['a', 'b', 'c', 'd', 'e'], dtype='object'),
 RangeIndex(start=3, stop=8, step=1))

Indexes are like arrays, except that they're immutable, i.e. `I1[2] = 3` would give an exception. Indexes also support union, intersection and symmetric difference, as shown below

In [18]:
I1.intersection(I3)

RangeIndex(start=3, stop=5, step=1)

In [22]:
I1.union(I3) # union, intersection etc can be chained

Index([0, 1, 2, 3, 4, 5, 6, 7, 'a', 'b', 'c', 'd', 'e'], dtype='object')

In [21]:
I1.symmetric_difference(I3)

Int64Index([0, 1, 2, 5, 6, 7], dtype='int64')

### `loc` and `iloc`

Due to confusion between implicit (python-based index) and explicit (the pandas index), loc, iloc and ix are used to access data in pandas arrays. `loc` always refers to the explicit index, whereas `iloc` refers to the implicit index

In [28]:
S2 = pd.Series([c for c in "abcde"], index = [1,2,4,5,6])
S2

1    a
2    b
4    c
5    d
6    e
dtype: object

In [31]:
S2[1:4] # note slicing is always implicit

2    b
4    c
5    d
dtype: object

In [33]:
S2[4] # but access is explicit! 

'c'

In [34]:
S2.iloc[4] # to avoid this confusion, use loc and iloc

'e'

DataFrames also allow index-based selection, very similar to numpy arrays

In [38]:
D.loc['California', 'area'] # note that you HAVE to use loc here;
# not using loc throws an exception

423967

In [44]:
D.values

array([[38332521,   423967],
       [26448193,   695662],
       [19651127,   141297],
       [19552860,   170312],
       [12882135,   149995]])

Note that _indexing_ refers to columns while _slicing_ refers to rows. Direct masking operations are also interpreted row wise rather than column wise

### Index Alignment

If binary operations are done on Series or DataFrame objects, the indices will be aligned while performing the operation, and the resulting Series/DataFrame will contain the union of the indices of the operands. `NaN` will be filled where the index does not belong to the intersection of the two operands, and this is standard NumPy/Pandas behaviour.

In [64]:
A = pd.Series(range(5))
B = pd.Series(range(3,8), index=range(3,8))

0    NaN
1    NaN
2    NaN
3    6.0
4    8.0
5    NaN
6    NaN
7    NaN
dtype: float64

In [65]:
A+B # NaN behaviour

0    NaN
1    NaN
2    NaN
3    6.0
4    8.0
5    NaN
6    NaN
7    NaN
dtype: float64

In [67]:
A.add(B, fill_value=0) # if you want NaN's filled with 0

0    0.0
1    1.0
2    2.0
3    6.0
4    8.0
5    5.0
6    6.0
7    7.0
dtype: float64

### More on NaN

Pandas uses the floating point NaN to mark absent values. This of course means that arrays of integers having a NaN in them are converted to floating point, _unless declared otherwise_. Note that the integer values also have their NA (`pd.Int64Dtype.na_value`), but by default, NaN would convert it to float.

In [78]:
pd.Series([1,2,np.nan,4]) # auto change to Float

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

In [79]:
pd.Series([1,2,None,4]) # smart change: None to NaN

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

In [80]:
pd.Series([1,2,None,4], dtype='Int32') 
# keeping it int, but accounting for NaN

0       1
1       2
2    <NA>
3       4
dtype: Int32

In [81]:
_80.dropna()

0    1
1    2
3    4
dtype: Int32

In [82]:
_79.fillna(3)

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

In [83]:
_80.isna()

0    False
1    False
2     True
3    False
dtype: bool

In [85]:
_80.isnull() # equivalent to isna

0    False
1    False
2     True
3    False
dtype: bool

In [112]:
A = pd.DataFrame(np.random.randint(0,50,(5,8)))
A = A[A>5]
A # has some random NaN values

Unnamed: 0,0,1,2,3,4,5,6,7
0,47,40,,18.0,10.0,46.0,10.0,21.0
1,6,38,38.0,37.0,28.0,,49.0,49.0
2,40,34,8.0,18.0,,31.0,24.0,
3,18,33,,19.0,9.0,19.0,,40.0
4,45,34,45.0,,13.0,6.0,6.0,46.0


In [123]:
A.dropna(axis='rows', thresh=7) # min no. of non-NaN values kept

Unnamed: 0,0,1,2,3,4,5,6,7
0,47,40,,18.0,10.0,46.0,10.0,21.0
1,6,38,38.0,37.0,28.0,,49.0,49.0
4,45,34,45.0,,13.0,6.0,6.0,46.0
