# Arithmetic

In [1]:
import pandas as pd
import numpy as np

## Arithmetic and Data Alignment

**Adding** together objects: similar to automatic outer join on the index labels in database  
In the case of **Series**:

In [4]:
s1 = pd.Series([1,2,3,4], index  = ['a','b','c','d'])
s2 = pd.Series([1,3,5,6,7], index = ['a','c','e','d','f'])

In [5]:
s1

a    1
b    2
c    3
d    4
dtype: int64

In [6]:
s2

a    1
c    3
e    5
d    6
f    7
dtype: int64

In [8]:
s1 + s2

a     2.0
b     NaN
c     6.0
d    10.0
e     NaN
f     NaN
dtype: float64


Introduce **missing values NaN** in the label locations that don't overlap, and it will propagate in further aruthmetic

In the case of **DataFrame**, its the same as in Series

In [2]:
df1 = pd.DataFrame({'A':[1,2]})
df2 = pd.DataFrame({'B':[3,4]})

In [15]:
df1

Unnamed: 0,A
0,1
1,2


In [16]:
df2

Unnamed: 0,B
0,3
1,4


In [17]:
df1 + df2

Unnamed: 0,A,B
0,,
1,,


If there's no common labels, the results will cotain **all nulls**

## Arithmetic methods with *fill values*

In [9]:
df1.add(df2,fill_value=0)

Unnamed: 0,A,B
0,1.0,3.0
1,2.0,4.0


Flexible arithmetic *methods*:
>add, radd +  
>sub, rsub -  
>div, rdiv /  
>floordiv, rfloordiv //  
>mul, rmul *  
>pow, rpow **

## Function Application and Mapping

In [11]:
frame = pd.DataFrame(np.random.randn(4,3), columns=list('abc'))
frame

Unnamed: 0,a,b,c
0,-0.511194,-0.752992,-1.506115
1,-0.349879,1.696236,-0.190295
2,-0.079399,0.332442,0.650247
3,1.043283,0.439715,-2.295419


**Numpy** ufuncs work with pandas objects

In [13]:
np.abs(frame)

Unnamed: 0,a,b,c
0,0.511194,0.752992,1.506115
1,0.349879,1.696236,0.190295
2,0.079399,0.332442,0.650247
3,1.043283,0.439715,2.295419



**lambda functions** also work with pandas

In [16]:
func1 = lambda x: x.max() - x.min()
frame.apply(func1)

a    1.554476
b    2.449229
c    2.945666
dtype: float64

once per row instead if you pass *axis=1* or *axis='columns'*

In [17]:
frame.apply(func1, axis=1)

0    0.994921
1    2.046115
2    0.729646
3    3.338702
dtype: float64

**Element-wise Python functions** can be used, too.

In [21]:
format = lambda x: '%.2f' %x
frame.applymap(format)
#For Series, use *.map(format)*

Unnamed: 0,a,b,c
0,-0.51,-0.75,-1.51
1,-0.35,1.7,-0.19
2,-0.08,0.33,0.65
3,1.04,0.44,-2.3


For Series, use *.map(format)*

## Sorting and Ranking

### Sorting  
To sort **lexicographically**, use the *sort_index* method

In [26]:
obj = pd.Series([4,7,-3,2], index=list('dabc'))
obj.sort_index()

a    7
b   -3
c    2
d    4
dtype: int64

**Ascending** by default, use *ascending=False* to change it

In [27]:
obj.sort_index(ascending=False)

d    4
c    2
b   -3
a    7
dtype: int64

To sort by its **values**, use *sort_values* method

In [29]:
obj.sort_values()
#Nan values are sorted to the end by default

b   -3
c    2
d    4
a    7
dtype: int64

Use *by* option of sort_values as the **sort keys**

In [33]:
frame.sort_values(by = 'c')

Unnamed: 0,a,b,c
3,1.043283,0.439715,-2.295419
0,-0.511194,-0.752992,-1.506115
1,-0.349879,1.696236,-0.190295
2,-0.079399,0.332442,0.650247


### Ranking

By default *rank* breaks ties by assigning each group the **mean** rank

In [35]:
obj = pd.Series([7,-5,7,4,2,0,4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [36]:
obj.rank(method = 'first')
#the order in which they're abserved in the data

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [37]:
obj.rank(method = 'max')
#assign tie values the maximun rank in the group

0    7.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    5.0
dtype: float64

>Methods in *rank()*: average, min, max, first, dense

## Asix Indexed with Duplicate Labels

While many functions (like reindex) require the labels be **unique**, it's **not mandatory**  
Indexing a label with multiple entries returns **Series**, while single entries return a **scalar value**

In [2]:
obj = pd.Series(range(5), index = list('aabbc'))
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [3]:
obj['a']

a    0
a    1
dtype: int64

In [4]:
obj['c']

4

Same logit extends to indexing rows in a DataFrame

## Sumarizing and Compoting Descriptive Statistics

Pandas objects are equiped with a set aof common mathematical and statistical methods

In [7]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=list('abcd'), columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [9]:
df.sum
#df.sum(axis=1, skipna=False)

<bound method DataFrame.sum of     one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3>

In [10]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [12]:
df.describe()
#produce multiple summary statistics

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


Methods:
>count  
>idxmin,argmin  
>var  
>pct_change
>...

## Correlation and Covariance

In [13]:
import pandas_datareader.data as web

In [14]:
all_data = {ticker: web.get_data_yahoo(ticker)
           for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [16]:
price = pd.DataFrame({ticker: data['Adj Close']
                      for ticker, data in all_data.items()})

In [17]:
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

In [20]:
price.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-02-10,321.549988,154.429993,188.699997,1508.680054
2020-02-11,319.609985,153.479996,184.440002,1508.790039
2020-02-12,327.200012,155.309998,184.710007,1518.27002
2020-02-13,324.869995,154.309998,183.710007,1514.660034
2020-02-14,324.950012,150.699997,185.350006,1520.73999


In [18]:
returns = price.pct_change()

In [19]:
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-02-10,0.00475,0.006649,0.026157,0.019909
2020-02-11,-0.006033,-0.006152,-0.022575,7.3e-05
2020-02-12,0.023748,0.011923,0.001464,0.006283
2020-02-13,-0.007121,-0.006439,-0.005414,-0.002378
2020-02-14,0.000246,-0.023394,0.008927,0.004014


### Covariance

In [21]:
returns.MSFT.cov(returns.IBM)

8.792434758309941e-05

In [25]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000242,7.9e-05,0.000131,0.000125
IBM,7.9e-05,0.00017,8.8e-05,7.9e-05
MSFT,0.000131,8.8e-05,0.000208,0.000146
GOOG,0.000125,7.9e-05,0.000146,0.000227


### Correlation

In [26]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.387847,0.584834,0.531777
IBM,0.387847,1.0,0.467403,0.40455
MSFT,0.584834,0.467403,1.0,0.670882
GOOG,0.531777,0.40455,0.670882,1.0


In [27]:
returns.corrwith(returns.IBM)

AAPL    0.387847
IBM     1.000000
MSFT    0.467403
GOOG    0.404550
dtype: float64

## Unique Values, Value Counts, and Membership

In [30]:
obj = pd.Series(list('cadaabbcc'))
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [32]:
obj.value_counts()
#In descending order by default

a    3
c    3
b    2
d    1
dtype: int64

In [34]:
mask = obj.isin(list('bc'))
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [35]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object