In [1]:
import numpy as np
import pandas as pd

In [2]:
# Helper Function to dusplay DataFrames in Horizontal 

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)


----


## Index Alignment

For any operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices in the process of performing the operation.

The resulting array contains **the *union* of indices of the two input arrays**. Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how Pandas marks missing data. 

This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default.

In [3]:
A = pd.Series([2, 4, 6, 8, 10], index=[0, 1, 2, 3, 4])
B = pd.Series([10, 12, 13], index=[1, 2, 4])
A + B

0     NaN
1    14.0
2    18.0
3     NaN
4    23.0
dtype: float64

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.
For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing.


In [4]:
A.add(B, fill_value=0)

0     2.0
1    14.0
2    18.0
3     8.0
4    23.0
dtype: float64

In [5]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Ohio', 'Texas', 'Oregon','Utah',])
display('df1','df2','df1+df2')

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0

Unnamed: 0,b,d,e
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Oregon,6.0,7.0,8.0
Utah,9.0,10.0,11.0

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,0.0,,3.0,
Oregon,,,,
Texas,6.0,,9.0,
Utah,,,,


In [6]:
display('df1','df2','df1.add(df2,fill_value=0)')

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0

Unnamed: 0,b,d,e
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Oregon,6.0,7.0,8.0
Utah,9.0,10.0,11.0

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,0.0,1.0,3.0,2.0
Oregon,6.0,,7.0,8.0
Texas,6.0,4.0,9.0,5.0
Utah,9.0,,10.0,11.0


##### Pandas provides these Flexible arithmetic methods

| Method | Description |
| :--- | :--- |
| add, radd | Methods for addition (+) |
| sub, rsub | Methods for subtraction (-) |
| div, rdiv | Methods for division (/) |
| floordiv, rfloordiv | Methods for floor division (//) |
| mul, rmul | Methods for multiplication (*) |
| pow, rpow | Methods for exponentiation (**) |

In [7]:
display('df1','df2','df1.sub(df2, fill_value=0)')

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0

Unnamed: 0,b,d,e
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Oregon,6.0,7.0,8.0
Utah,9.0,10.0,11.0

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,0.0,1.0,1.0,-2.0
Oregon,-6.0,,-7.0,-8.0
Texas,0.0,4.0,1.0,-5.0
Utah,-9.0,,-10.0,-11.0


In [8]:
display('df1','df2','df1.rsub(df2, fill_value=0)')

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0

Unnamed: 0,b,d,e
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Oregon,6.0,7.0,8.0
Utah,9.0,10.0,11.0

Unnamed: 0,b,c,d,e
Colorado,-6.0,-7.0,-8.0,
Ohio,0.0,-1.0,-1.0,2.0
Oregon,6.0,,7.0,8.0
Texas,0.0,-4.0,-1.0,5.0
Utah,9.0,,10.0,11.0


A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrames``

In [9]:
A = pd.Series([4, 8], index=['A', 'C'])
B = pd.Series([14, 18], index=['A', 'C'])
df_A = pd.DataFrame([A, B])
print(df_A)

    A   C
0   4   8
1  14  18


In [10]:
A = pd.Series([2, 4, 6], index=['B','A','C'])
B = pd.Series([10, 12, 14], index=['B','A','C'])
C = pd.Series([21, 23, 25], index=['B','A','C'])
df_B = pd.DataFrame([A, B, C])
print(df_B)

    B   A   C
0   2   4   6
1  10  12  14
2  21  23  25


In [11]:
df_A + df_B

Unnamed: 0,A,B,C
0,8.0,,14.0
1,26.0,,32.0
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.

As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.

Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):

In [12]:
fill = 100
df_A.add(df_B, fill_value=fill)

Unnamed: 0,A,B,C
0,8.0,102.0,14.0
1,26.0,110.0,32.0
2,123.0,121.0,125.0


## Operations Between DataFrame and Series

When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.
Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

In [13]:
A = np.random.randint(10, size=(3, 4))
A

array([[3, 1, 7, 1],
       [9, 5, 3, 2],
       [9, 9, 5, 2]])

In [14]:
A[0]

array([3, 1, 7, 1])

In [15]:
A-A[0]

array([[ 0,  0,  0,  0],
       [ 6,  4, -4,  1],
       [ 6,  8, -2,  1]])

According to NumPy's broadcasting rules, subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In [16]:
A = pd.Series([2, 4, 6], index=['B','A','C'])
B = pd.Series([10, 12, 14], index=['B','A','C'])
C = pd.Series([21, 23, 25], index=['B','A','C'])
df_A = pd.DataFrame([A, B, C])
print(df_A)

    B   A   C
0   2   4   6
1  10  12  14
2  21  23  25


In [17]:
df_A.iloc[0]

B    2
A    4
C    6
Name: 0, dtype: int64

In [18]:
df_A - df_A.iloc[0]

Unnamed: 0,B,A,C
0,0,0,0
1,8,8,8
2,19,19,19


In [19]:
df_A.subtract(df_A.iloc[0])

Unnamed: 0,B,A,C
0,0,0,0
1,8,8,8
2,19,19,19


In [20]:
df_A.subtract(df_A.iloc[0], axis=1)

Unnamed: 0,B,A,C
0,0,0,0
1,8,8,8
2,19,19,19


In [21]:
df_A.subtract(df_A['B'], axis=0)


Unnamed: 0,B,A,C
0,0,2,4
1,0,2,4
2,0,2,4


In [22]:
df_A.sub(df_A['B'], axis=0)

Unnamed: 0,B,A,C
0,0,2,4
1,0,2,4
2,0,2,4


---


## Dropping Entries from an Axis

In [23]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
new_obj = obj.drop('c')
new_obj
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

In [24]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [25]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [26]:
data.drop('two', axis=1)
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [27]:
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

## Sorting and Ranking

Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, **which returns
a new, sorted object**.

In [28]:
## Returns
obj = pd.Series([10, 65, 15, 75, 30, 50], index=['d', 'a', 'e', 'b','c','f'])
obj

d    10
a    65
e    15
b    75
c    30
f    50
dtype: int64

To sort a Series by its Index, use its sort_index method:

In [29]:
obj=obj.sort_index()
obj

a    65
b    75
c    30
d    10
e    15
f    50
dtype: int64

To sort a Series by its values, use its sort_values method:

In [30]:
obj.sort_values()

d    10
e    15
c    30
f    50
a    65
b    75
dtype: int64

Any missing values are sorted to the end of the Series by default:

In [31]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

With a DataFrame, you can sort by index on either axis

In [32]:
frame = pd.DataFrame(data = [[10, 65, 15, 75],
                             [30, 50, 20, 90]],
                     index=['three', 'one'],
                        columns=['sld', 'za', 'Jb', 'gc'])
display('frame','frame.sort_index()','frame.sort_index(axis=1)')

Unnamed: 0,sld,za,Jb,gc
three,10,65,15,75
one,30,50,20,90

Unnamed: 0,sld,za,Jb,gc
one,30,50,20,90
three,10,65,15,75

Unnamed: 0,Jb,gc,sld,za
three,15,75,10,65
one,20,90,30,50


When sorting a DataFrame, you can use the data in one or more columns as the sort
keys. To do so, pass one or more column names to the by option of sort_values

In [33]:
display('frame.sort_index(axis=1).sort_values(by="gc")',
        'frame.sort_index(axis=1).sort_values(by=["gc","za"])')

Unnamed: 0,Jb,gc,sld,za
three,15,75,10,65
one,20,90,30,50

Unnamed: 0,Jb,gc,sld,za
three,15,75,10,65
one,20,90,30,50


Ranking assigns ranks from one through the number of valid data points in an array.
The rank methods for Series and DataFrame are the place to look; by default rank
breaks ties by assigning each group the mean rank:

In [34]:
obj = pd.Series([10, 65, 15, 75, 30, 50], index=['d', 'a', 'e', 'b','c','f'])
obj.rank()

d    1.0
a    5.0
e    2.0
b    6.0
c    3.0
f    4.0
dtype: float64

Ranks can also be assigned according to the order in which they’re observed in the data. Here, instead of using the average rank 2.5 for the values 15, they instead have been set to 2 and 3 because label e precedes label c in the data.

In [35]:
obj = pd.Series([10, 65, 15, 75, 15, 50], index=['d', 'a', 'e', 'b','c','f'])
obj.rank(method='first')

d    1.0
a    5.0
e    2.0
b    6.0
c    3.0
f    4.0
dtype: float64

You can rank in descending order, too:

In [36]:
obj.rank(ascending=False, method='max')

d    6.0
a    2.0
e    5.0
b    1.0
c    5.0
f    3.0
dtype: float64

Tie-breaking methods with rank

|Method | Description |
| :--- | :--- |
|'average' Default: | assign the average rank to each entry in the equal group |
|'min' | Use the minimum rank for the whole group |
|'max' | Use the maximum rank for the whole group |
|'first' | Assign ranks in the order the values appear in the data |
|'dense' | Like method='min', but ranks always increase by 1 in between groups rather than the number of equal elements in a group |

## Function Application and Mapping

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this. Here the function f, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in frame. The result is a Series having the columns of frame as its index.

In [50]:
f = lambda x: x.max() - x.min()

def f_series(x):
    return x*2

In [51]:
# Function multiplies each cell by 2
obj.apply(f_series)

d     20
a    130
e     30
b    150
c     30
f    100
dtype: int64

In [43]:
print(frame)

       sld  za  Jb  gc
three   10  65  15  75
one     30  50  20  90


In [45]:
# Function f apply for each column of the DataFrame by Column or axis 0
# Outputs max - min 

print(frame.apply(f))

sld    20
za     15
Jb      5
gc     15
dtype: int64


In [56]:
# To apply the function based on the axis, default is 0
# Axis 1 or col will apply the function for each row.

frame.apply(f, axis=1)

three    65
one      70
dtype: int64

***

## Axis Indexes with Duplicate Labels

Up until now all of the examples we’ve looked at have had unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [58]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The index’s is_unique property can tell you whether its labels are unique or not:

In [61]:
obj.index.is_unique

False

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:

In [62]:
obj['a']

a    0
a    1
dtype: int64

In [63]:
obj['c']

4

This can make your code more complicated, as the output type from indexing can vary based on whether a label is repeated or not. The same logic extends to indexing rows in a DataFrame:

In [65]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,1.069118,0.268167,-1.829441
a,0.560966,1.577026,0.502531
b,-0.578913,-0.605591,-0.328276
b,-1.100376,0.457074,-0.308125


In [66]:
df.loc['b']

Unnamed: 0,0,1,2
b,-0.578913,-0.605591,-0.328276
b,-1.100376,0.457074,-0.308125


## Summarizing and Computing Descriptive Statistics

Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame:

In [67]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


Calling DataFrame’s sum method returns a Series containing column sums. NaN values are ignored. Method without axis parameter sum function is exectuete for each column.

In [72]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [73]:
# Passing axis='columns' or axis=1 sums across the columns instead

df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled with the skipna option

In [74]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

##### Options for reduction methods
| Method | Description |
| :--- | :--- |
| axis | Axis to reduce over; 0 for DataFrame’s rows and 1 for columns |
| skipna | Exclude missing values; True by default |
| level | Reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |

Some methods, like idxmin and idxmax, return indirect statistics like the index value where the minimum or maximum values are attained

In [76]:
df.idxmax()

one    b
two    d
dtype: object

Other methods are accumulations

|Method | Description |
| :--- | :--- |
|count | Number of non-NA values |
|describe | Compute set of summary statistics for Series or each DataFrame column |
|min, max | Compute minimum and maximum values |
|argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively |
|idxmin, idxmax | Compute index labels at which minimum or maximum value obtained, respectively |
|quantile | Compute sample quantile ranging from 0 to 1 |
|sum | Sum of values |
|mean | Mean of values |
|median | Arithmetic median (50% quantile) of values |
|mad | Mean absolute deviation from mean value |
|prod | Product of all values |
|var | Sample variance of values |
|std | Sample standard deviation of values |
|skew | Sample skewness (third moment) of values |
|kurt | Sample kurtosis (fourth moment) of values |
|cumsum | Cumulative sum of values |
|cummin, cummax | Cumulative minimum or maximum of values, respectively |
|cumprod | Cumulative product of values |
|diff | Compute first arithmetic difference (useful for time series) |
|pct_change | Compute percent changes |

***