# Pandas Objects
## What is Pandas?
>Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

[Go through Pandas Introduction](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)

Note: This lecture is based on *[Chapter 3 "Data Manipulation with Pandas"](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)* of Jake VanderPlas' *Python Data Science Handbook*.

## Fundamental Pandas Data Structures
### Series
- One-dimensional array of indexed data
- Can be created from a list or array

In [51]:
import pandas as pd
import numpy as np

In [6]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

- ``Series`` is composed of a sequence of values and sequence of indices. 
- They are accessed with ``values`` and ``index`` attributes respectively.

In [8]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [10]:
type(data.values)

numpy.ndarray

As can be seen ``values`` of a series is a NumPy array.

The ``index`` is an array-like object of type ``pd.Index``.

In [13]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [12]:
type(data.index)

pandas.core.indexes.range.RangeIndex

#### Accessing Data
Pandas ```Series``` data can be accessed using indices with square-bracket notation.

In [19]:
data[1]

0.5

In [21]:
data[1:3]

1    0.50
2    0.75
dtype: float64

#### NumPy array vs. Pandas Series
- NumPy has an implicit integer index.
- Pandas Series has an explicitly defined index.
  - The index need not be an integer. It can be of any desired type.

In [23]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [25]:
data['b']

0.5

Index of Pandas Series can be thought of as a specialization of Python dictionary. 

In [30]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

#### Constructing Series Objects

In [33]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [36]:
pd.Series(data, index=[100, 200, 300])

100   NaN
200   NaN
300   NaN
dtype: float64

In [38]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [39]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

### DataFrame
- ``Series`` is one-dimensional 
- ``DataFrame`` is two-dimensional
  - Sequence of ``Serie``sharing the same index.

In [40]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [41]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like ``Series`` object, Pandas ``DataFrame`` also has ``index`` object to give access to the index labels
  - It is a generalization of two-dimentional NumPy array with both row and column indices for accessing the data. 

In [42]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

``DataFrame`` also has ``columns`` attribute. 
  - ``Index`` object holding labels for columns

In [43]:
states.columns

Index(['population', 'area'], dtype='object')

``DataFrame`` maps a column name to a ``Series`` of column data. 

In [45]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Note: 
>In a two-dimesnional NumPy array, data[0] will return the first row. For a DataFrame, data['col0'] will return the first column.

### Constructing DataFrame Objects

In [46]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [47]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [48]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


``DataFrame`` can be constructed from a dictionary of ``Series`` object

In [49]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


If column and index are not specified, by default, an integer index is used. 

In [53]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.114741,0.437385
b,0.670254,0.193729
c,0.258083,0.422283


Creating a DataFrame from a NumPy structured arrays

In [54]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [55]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


### Index
- Can be thought of as an *immutable array* or an *ordered set*
- Operates like an immutable array
- Has many similar attributes as NumPy 

In [56]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [58]:
ind[1]

3

In [61]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [62]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [63]:
ind[1] = 0

TypeError: Index does not support mutable operations

>The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [64]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [65]:
indA & indB  # intersection

Int64Index([3, 5, 7], dtype='int64')

In [66]:
indA | indB  # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [67]:
indA ^ indB  # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

## Data Indexing and Selection

### Data Selection in Series
- ``Series`` objec acts like a one-dimensional NumPy array, and it also acts like a Python dictionary.

#### Series as Dictionary
``Series`` object provides a mapping from a collection of keys to a collection of values. 

In [68]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [69]:
data['b']

0.5

Python expressions used for dictionaries can be used on Pandas ``Series`` objects.

In [70]:
'a' in data

True

In [71]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [72]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [75]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

#### Series as one-dimensional array
``Series`` provides the similar mechanism found in NumPy array to access the elements

In [83]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

**NOTE**: For slicing
- With ``explicit indexing`` (e.g. data['a':'c']), the final index is included in the slice
- With ``implicit indexing`` (e.g. data[0:2]), the final index is NOT included in the slice.

In [79]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [81]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [82]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

### Indexers: loc, iloc, and ix

Possible source of confusion:
- If a ``Series`` has an explicit integer index, an indexing operation like data[1] will use the explicit indices.
- Slicing operation like data[1:3] uses the implicit index.

> - Use loc[] to choose rows and columns by label.
> - Use iloc[] to choose rows and columns by position.
> - Explicitly designate both rows and columns, even if it's with ":"

For more detailed discussion, refer [here](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c).

In [85]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [88]:
# explicit index when indexing
data[1]

'a'

In [90]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Due to the potential confusion, Pandas provides special indexer *attributes*. Note: They are not functional methods, but attributes exposing slicing interface to the data in the ``Series``.

#### ``loc`` attribute
- Allows indexing and slicing that always referencesw the explicit index

In [92]:
data.loc[1]

'a'

In [94]:
data.loc[1:3]

1    a
3    b
dtype: object

#### ``iloc`` 
- Allows indexing and slicing that always references the implicit Python-style index

In [96]:
data.iloc[1]

'b'

In [98]:
data.iloc[1:3]

3    b
5    c
dtype: object

``ix`` attribute
- A hybrid of the two
- For ``Series`` object, it is equivalent to standard ``[]``-based indexing.

>One guiding principle of Python code is that "explicit is better than implicit." The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

For more detailed discussions, refer [here](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c).

### Data Selection in DataFrame
#### DataFrame as a dictionary

In [99]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Using dictionary-style indexing of the column name to access the 
individual ``Series``.

In [100]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Using attribute-style access with column names (must be strings).

Note: This way of accessing ``Series`` is not a good practice and is discouraged.

In [101]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [102]:
data.area is data['area']

True

In [103]:
data.pop is data['pop']

False

In [106]:
# data.pop is a built in DataFrame method
type(data.pop)

method

In [105]:
type(data['pop'])

pandas.core.series.Series

In [107]:
# Modifying DataFrame objects using the dictionary style
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


NOTE: The above example show element-by-elemen arithmetic operation on the ``DataFrame`` object. 

### DataFrame as two-dimensional array

The raw underlying data array can be accessed using the ``values`` attribute.

In [109]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

[Transposing](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.T.html#pandas.DataFrame.T) the full ``DataFrame`` to swap rows and columns.

In [111]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


For the complete list of Pandas ``DataFrame`` objects, refer [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

Accessing a row of a ``DataFrame``

In [112]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

Passing a single "index" to a ``DataFrame`` accesses a column.

In [113]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

For array-style indexing, Pandas provides ``loc``, ``iloc``, and ``ix`` indexers.

In [114]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [115]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The `ix` indexer allows a hybrid of the two.

NOTE: For newer verson of Pandas, ``ix`` and other attributes are removed. Refer [here](https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#removal-of-prior-version-deprecations-changes) for more information.

In [117]:
data.ix[:3, :'pop']

AttributeError: 'DataFrame' object has no attribute 'ix'

Using NumPy-style data access patterns. ``loc`` indexer can combine masking and fancy indexing.

In [118]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


Modifying values using indexing conventions

In [119]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


## Operating on Data in Pandas
Pandas inherits much of NumPy's ability to perform quick element-wise operations including addition, subtraction, multiplication, trigonometric functions, exponential and logarithmic functions, etc. 

>Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc.

### UFuncs: Index Preservation

In [120]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [121]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


Applying a NumPy ufunc on the Pandas ``Series`` results in another Pandas object with the indices preserved.

In [123]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [124]:
type(np.exp(ser))

pandas.core.series.Series

In [125]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


### UFuncs: Index Alignment
> For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation

In [126]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [127]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The resulting array contains the union of indices.

In [128]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item that does not exist in both objects is marked with ``NaN``, or "Not a Number" - missing data in Pandas. 

In [129]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

Use ``file_value`` argument to specify the default value to use for ``NaN`` values. 

In [130]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index Alignment in DataFrame

In [131]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [132]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [133]:
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


Indices are aligned correctly and sorted. 

In [134]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


In [135]:
A.stack()

0  A     1
   B    11
1  A     5
   B     1
dtype: int32

In [140]:
# For the reference to stack, 
# refer here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html
A.stack().mean()

4.5

In [137]:
(1+11+5+1)/4

4.5

### UFuncs: Operations Between DataFrame and Series
- Similar to the operations between a two-dimensional and one-dimentional NumPy arrays.
- The index and column alignment is maintained.

In [141]:
A = rng.randint(10, size=(3, 4))
A

array([[3, 8, 2, 4],
       [2, 6, 4, 8],
       [6, 1, 3, 8]])

In [142]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-1, -2,  2,  4],
       [ 3, -7,  1,  4]])

> According to NumPy's broadcasting rules (see Computation on Arrays: Broadcasting), subtraction between a two-dimensional array and one of its rows is applied row-wise.

> In Pandas, the convention similarly operates row-wise by default:

In [143]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


If column-wise operation is desired, ``axis`` keyword can be specified. 

In [144]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


Indices are automatically aligned.

In [145]:
halfrow = df.iloc[0, ::2]
halfrow

Q    3
S    2
Name: 0, dtype: int32

In [148]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


## Handling Missing Data
> The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types.

> ... Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point NaN value, and the Python None object.

### NaN and None in Pandas
Pandas handles NaN and None interchangeably. 

In [149]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

Pandas does automatic type-casting when NA values are present.
Value of ``None`` is converted to ``NaN``. 

In [150]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int32

In [151]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

The following table lists the upcasting conventions in Pandas when NA values are introduced:

|Typeclass     | Conversion When Storing NAs | NA Sentinel Value      |
|--------------|-----------------------------|------------------------|
| ``floating`` | No change                   | ``np.nan``             |
| ``object``   | No change                   | ``None`` or ``np.nan`` |
| ``integer``  | Cast to ``float64``         | ``np.nan``             |
| ``boolean``  | Cast to ``object``          | ``None`` or ``np.nan`` |

Keep in mind that in Pandas, string data is always stored with an ``object`` dtype.

### Useful methods for detecting, removing and replacing null values:
- isnull(): Generate a boolean mask indicating missing values
- notnull(): Opposite of isnull()
- dropna(): Return a filtered version of the data
- fillna(): Return a copy of the data with missing values filled or imputed

In [152]:
data = pd.Series([1, np.nan, 'hello', None])

In [153]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

#### Using Boolean Masks as ``Series`` or ``DataFrame`` index

In [154]:
data[data.notnull()]

0        1
2    hello
dtype: object

### Dropping null values

In [155]:
data.dropna()

0        1
2    hello
dtype: object

In [156]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


>We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so dropna() gives a number of options for a DataFrame.

>By default, dropna() will drop all rows in which any null value is present:

In [157]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [160]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


### Filling null values

In [161]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [162]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

Forward-fill to propagate the previous value forward:

In [165]:
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

Back-fill to propagate the next values backward:

In [167]:
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [168]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [169]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2
0,1.0,1.0,2.0
1,2.0,3.0,5.0
2,,4.0,6.0
