In [2]:
import pandas as pd
import numpy as np

## Pandas Series

1. It like 1d ndarray but it also has data label (called index)
2. Compared with ndarray, you could get value by index
3. All the arithmetic operation from numpy and boolean index operation can be used
4. Could also view it as a dict {index : value}

In [3]:
obj = pd.Series([4, 5, 5, -4])
obj
# left side is index, right side is value

0    4
1    5
2    5
3   -4
dtype: int64

In [4]:
obj.values
obj.index
# add manual index
obj = pd.Series([2, 3, 4, -7], index=['a', 'b', 'd', 'e'])
obj
#index can be same

a    2
b    3
d    4
e   -7
dtype: int64

In [5]:
obj['a']
obj[['a', 'b', 'e']]
obj[obj > 0] * 2
np.exp(obj)

a     7.389056
b    20.085537
d    54.598150
e     0.000912
dtype: float64

Three values found in sdata were placed in the appropriate locations, but since
no value for 'California' was found, it appears as NaN (not a number), which is con‐
sidered in pandas to mark missing or NA values. Since 'Utah' was not included in
states , it is excluded from the resulting object.

In [6]:
'b' in obj
# Create Series from python dict
sdata = {'Ohio': 35000, 'Texax': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Texax', 'Ohio', 'Oregon']
# the index will automtically fit the dict key
obj3 = pd.Series(data=sdata, index=states)

obj3


California        NaN
Texax         71000.0
Ohio          35000.0
Oregon        16000.0
dtype: float64

In [7]:
# use isnull and notnull to check NaN value
obj3.isnull()
obj3.notnull()

California    False
Texax          True
Ohio           True
Oregon         True
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations

In [8]:
obj4 = pd.Series(data=sdata)
obj4 + obj3
# NaN do arithmetic operation is still NaN

California         NaN
Ohio           70000.0
Oregon         32000.0
Texax         142000.0
Utah               NaN
dtype: float64

In [9]:
# name property
obj4.name = 'population'
obj4.index.name = 'state'
obj4
# index could be modified directly
obj3.index = [1, 2, 3, 4]
obj3

1        NaN
2    71000.0
3    35000.0
4    16000.0
dtype: float64

## DataFrame

1. A table data structure
2. A ordered collection of columns, each columns could have different data type
3. It has both row and column index
4. Can be thought of as a dict of Series all sharing the same index
5. The data is stored as one or more two-dimensional blocks

### Create a DataFrame

1. 2D ndarry
2. dict of arrays, lists or tuples
3. Numpy structure/record array
4. dict of series
5. dict of dicts
6. list of dicts or series
7. list of lists or tuple
8. another dataframe
9. numpy masked array

In [10]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year' : [2000, 2001, 2002, 2001, 2002],
        'pop' : [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
frame
# Dataframe will automatically add index just like Series and 
# column is placed as sorted order
# If you specify the column order, it will order by specified order
pd.DataFrame(data, columns=['year', 'pop', 'state'])
# Also if the column value is not contained in the dict data, it will appear with missing 
# values
frame2 = pd.DataFrame(data, columns=['year', 'pop', 'state', 'debt'])
# A column could be retrieved as Series by dict notation or attribute (same)
frame['year']
frame.year
# the Series name is set as dict key

0    2000
1    2001
2    2002
3    2001
4    2002
Name: year, dtype: int64

In [11]:
# Row could be retrieved by position or label name with loc attribute
frame.index = ['one', 'two', 'three', 'four', 'five']
frame.loc['one']
# set column value by assignment
frame2.index = ['one', 'two', 'three', 'four', 'five']
frame2.debt = 100
frame2

Unnamed: 0,year,pop,state,debt
one,2000,1.5,Ohio,100
two,2001,1.7,Ohio,100
three,2002,3.6,Ohio,100
four,2001,2.4,Nevada,100
five,2002,2.9,Nevada,100


In [12]:
# If you want to assign a array or list to Dataframe column, the lenght should be same!
# If you assign with a Series, the lables will be realigned exactly to the Dataframe index
# any unmatched places will be NaN
val = pd.Series([-1.2, -1.5, 1.7], index=['two', 'one', 'four'])
frame2.debt = val
frame2
# Assigning a column that doesn't exit will create a new column
frame2['eastern'] = frame2.state == 'Ohio'
frame2
# del will delete the column
del frame2['eastern']
frame2


Unnamed: 0,year,pop,state,debt
one,2000,1.5,Ohio,-1.5
two,2001,1.7,Ohio,-1.2
three,2002,3.6,Ohio,
four,2001,2.4,Nevada,1.7
five,2002,2.9,Nevada,


#### The column return from indexing a Dataframe is a view on the underlying data, not a copy
#### All the changes will reflected on the original data, other you need to use copy method

In [13]:
# Another way to create dataframe is by nested dict
pop = {'Nevada' : {2001 : 2.4, 2002: 2.9},
       'Ohio' : {2000 : 1.5, 2001 : 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3
# Pandas will interpret the outer dict key as columns and the inner keys as the row indices
# Transpose the column and row and all the inner keys are combined
frame3.T
# Or you could set the index
# frame3.index = [2001, 2002, 2003]
index_pop = pd.Series([2001, 2002, 2003])
pd.DataFrame(pop, index=index_pop)

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [14]:
#use series as dict value
pdate = {'Ohio' : frame3['Ohio'][:-1],
         'Nevada' : frame3['Nevada'][:2]}
pd.DataFrame(pdate)
#if you set the index and column name, it will be show
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3.values # return value as a two dimensional ndarray
frame2.values # if the columns data type is different, it will return with accommodate one

array([[2000, 1.5, 'Ohio', -1.5],
       [2001, 1.7, 'Ohio', -1.2],
       [2002, 3.6, 'Ohio', nan],
       [2001, 2.4, 'Nevada', 1.7],
       [2002, 2.9, 'Nevada', nan]], dtype=object)

### Index object

1. pd.Index will convert the index axis to ndarray (consist of python object)
2. pd.Int64Index will give int index
3. pd.MultiIndex will give you multiple index in single axis
4. DatatimeIndex will show ns timestamp by using numpy datatime64 type
5. PeriodIndex will show time period index

#### Index Function
1. append (append with another index and get a new one)
2. diff (get the difference with another index)
3. intersection
4. union
5. isin
6. delete (delete at i position)
7. drop
8. insert (insert at i)
9. is_monotonic (all index is larger than previous one)
10. is_unique
11. unique

In [15]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
# cannot do index[1] = 2
# index object is immutable so it could share index object data structure
index = pd.Index(range(3))
obj2 = pd.Series(range(1, 4), index=index)
obj2.index is index

True

In [16]:
#Index could also behaves like a fixed-size set
'Ohio' in frame3.columns
2002 in frame3.index
# Index could have duplicated labels

True

## Essential Functionality

### Reindex
1. create a new object with the data conformed to a new index
2. if one index is not exist, it will be filled as NaN or you could set it
3. For ordered data like time series, we may need to filling value when reindex
4. we could use method option to do this
    1. ffill or pad will fill the forward index with previous value  (forward fill)
    2. bfill or backfill will fill the previous index with current value (backward fill)

In [17]:
obj = pd.DataFrame([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

Unnamed: 0,0
a,-5.3
b,7.2
c,3.6
d,4.5
e,0.0


In [18]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [19]:
frame = pd.DataFrame(np.arange(9).reshape(3,3), index=['a', 'c', 'd'], 
                     columns=['Ohio', 'Texas', 'California'])
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [20]:
# the column can be reindexed with columns keyword
states = ['Texax', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texax,Utah,California
a,,,2
c,,,5
d,,,8


In [21]:
#reindex by label-indexing with loc
frame.loc[['a', 'b', 'c', 'd'], states]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,Texax,Utah,California
a,,,2.0
b,,,
c,,,5.0
d,,,8.0


### Dropping entries from an Axis
1. drop method will return a new object with the indicated value delete from an axis

In [22]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj2 = obj.drop('c')
new_object = obj.drop(['c', 'a'])
new_object
frame = pd.DataFrame(np.arange(9).reshape(3,3), index=['a', 'c', 'd'], 
                     columns=['Ohio', 'Texas', 'California'])
# for data frame, index values can be deleted from either axis.
frame.drop('Ohio', axis=1) # the axis is 0 as default, so we need to change here
frame.drop('Ohio', axis='columns') 
# using inplace would not return a new object
frame.drop('Ohio', axis=1, inplace=True)
frame

Unnamed: 0,Texas,California
a,1,2
c,4,5
d,7,8


### Indexing, selection and filtering
1. df[val] will select a single column
2. df.loc[val] will select a single row or a subset of row by label
3. df.loc[:, val] will select a single column or a subset of column by label
4. df.loc[val1, val2]. will select both rows and columns by label
5. df.iloc[where] will select a single row or subset of row by integer position
6. df.iloc[:, where] will select single column or subset of column by integer position
7. df.iloc[where_i, where_j] will select both rows and columns by integer position
8. df.at[label_i, label_j] will select a single scalar value by row and column label
9. df.iat[i, j] will select a single scalar value by row and column integer position
10. get_value, set_value will set single value by row and column label

In [23]:
# the index could used integers or label names
obj = pd.Series(range(4), index=['a', 'b', 'c', 'd'])
obj[['a', 'c']]
obj[obj < 2]
# if you slice with lable name, the end point is inclusive!!!
obj['b':'d']
obj['b':'d'] = -1
obj

a    0
b   -1
c   -1
d   -1
dtype: int64

In [24]:
# for dataframe, indexing is to get one or more columns 
data = pd.DataFrame(np.arange(9).reshape(3,3), index=['a', 'c', 'd'], 
                     columns=['Ohio', 'Texas', 'California'])
data[['Ohio', 'Texas']]
# special cases
data[:1] #slicing is from row index!!!
# special cases
data[data['Ohio'] > 1]  # boolean index 

Unnamed: 0,Ohio,Texas,California
c,3,4,5
d,6,7,8


#### Selection with loc and iloc on rows
1. loc using axis labels
2. iloc using integers index

In [25]:
data.loc['a', ['Ohio', 'Texas']]
# select a single row with multiple columns

Ohio     0
Texas    1
Name: a, dtype: int64

In [26]:
data.iloc[0, [0, 1]]

Ohio     0
Texas    1
Name: a, dtype: int64

In [27]:
# both works with slices
data.loc[:'c', 'Ohio']
data.iloc[:2, :2]

Unnamed: 0,Ohio,Texas
a,0,1
c,3,4


if you have an axis index containing integers, data selection will always be label-oritened

### Arithmetic and Data Alignment

1. An important pandas feature for some applications is the behavior of arithmetic
between objects with different indexes. When you are adding together objects, if any
index pairs are not the same, the respective index in the result will be the union of the
index pairs.
2. For any not overlapping label location, there will be filled with the missing value NaN
3. Missing value will be propagate in the further arithmetic operation
4. In case of dataframe, the alignment will be performed on both rows and columns

#### Basic Method
Each of them has a counterpart, starting with the letter r , that has arguments flipped.

1. add and radd
2. sub and rsub
3. div and rdiv
4. floordive and rfloordive
5. mul, rmul
6. pow, rpow

In [28]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [29]:
# Dataframe alignment
# the index and columns are the union of the ones in each dataframe
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [30]:
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [1, 2]})
df1 + df2
"""
If you add DataFrame objects with no column or row labels in common, 
the result will contain all nulls
"""

'\nIf you add DataFrame objects with no column or row labels in common, \nthe result will contain all nulls\n'

In [31]:
# use fill_value to fill the NaN position
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1.add(df2, fill_value=0)
"""
If data in both corresponding DataFrame locations is missing
    the result will be missing
"""
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,b,d,e
Ohio,0.0,2.0,0
Texas,3.0,5.0,0
Colorado,6.0,8.0,0


In [32]:
1 / df1
df1.rdiv(1)
# the same thing!
df1 * 5
df1.rpow(5)

Unnamed: 0,b,c,d
Ohio,1.0,5.0,25.0
Texas,125.0,625.0,3125.0
Colorado,15625.0,78125.0,390625.0


##### Operation between Dataframe and Series

In [33]:
arr = np.arange(12.).reshape((3, 4))
arr - arr[0]
"""
It is broadcasting, performed once for each row. Operations between a dataframe and series
are similar
"""
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
"""
arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame’s columns, broadcasting down the rows:
"""
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [34]:
"""
If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union:
"""
series = pd.Series(range(3), index=['b', 'e', 'f'])
frame - series
"""
if you want to broadcast over the columns, matching on rows, you have to use sub method
"""
series = frame['d']
frame.sub(series, axis='index') # axis number is the axis you want to match

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


## Function Application and Mapping
1. numpy unary function also work with pandas objects

In [35]:
# generate from standard normal distribution
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
                    index=['Utah', 'Ohio', 'Texas', 'Oregon']) 
frame.abs()
# applying a function on one-dimensional arrays to each column or row
frame

Unnamed: 0,b,d,e
Utah,0.519472,-1.220424,0.526148
Ohio,-0.971905,2.038744,0.079011
Texas,1.389339,0.273197,-0.654354
Oregon,-0.856054,-0.928078,-0.424623


In [36]:
f = lambda x : x.max() - x.min()
# axis along which the function is applied!!!!
frame.apply(f, axis=1)
frame.apply(f)

b    2.361244
d    3.259168
e    1.180502
dtype: float64

In [37]:
"""
The function passed to apply need not return a scalar value; it can also return a Series 
with multipl values
"""
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f, axis=1)

Unnamed: 0,min,max
Utah,-1.220424,0.526148
Ohio,-0.971905,2.038744
Texas,-0.654354,1.389339
Oregon,-0.928078,-0.424623


In [38]:
"""
Element-wise Python functions can be used, too. Suppose you wanted to compute a
formatted string from each floating-point value in frame .
"""
format_f = lambda x : '%.2f' %x
frame.applymap(format_f)
"""
The reason for the name applymap is that Series has a map method for applying an
element-wise function:
"""
frame['e'].map(format_f)

Utah       0.53
Ohio       0.08
Texas     -0.65
Oregon    -0.42
Name: e, dtype: object

## Sorting and Ranking
1. use sort_index method to sort by row or column, and return a new object

In [39]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
# sort by index value 
obj.sort_index(inplace=True)
obj

a    1
b    2
c    3
d    0
dtype: int64

In [40]:
# For dataframe, you can sort by index on either axis
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [41]:
frame.sort_index()
frame.sort_index(axis=1, ascending=False) # ascending or not

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [42]:
# sort a Series by its value, use its sort_values method
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()
# any missing value are sorted to the end of the Series by default
obj = pd.Series([4, np.nan, 7, -3, 2, np.nan])
obj.sort_values()

3   -3.0
4    2.0
0    4.0
2    7.0
1    NaN
5    NaN
dtype: float64

In [43]:
"""
When sort a dataframe, you can use the data in one or more column as the sort key
pass them to the by option of sort_values
"""
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


#### Ranking

1.Ranking assigns ranks from 1 to the number of valid data points in an array.
The rank methods for Series and DataFrame are the place to look; by default rank
breaks ties by assigning each group the mean rank

In [44]:
obj = pd.Series([7, -5, 7, 4, 2, 0 ,4])
obj.rank()
# also you could assign the rank according to the order in which they're observed in data
# it will break the tie situation
obj.rank(method='first') # default method is average
obj.rank(method='max') # use the maximum rank for the whole group
obj.rank(method='min') # use the minmum rank for the whole group
obj.rank(method='dense') # just like min but ranks always increase by 1 rather than n of group


0    5.0
1    1.0
2    5.0
3    4.0
4    3.0
5    2.0
6    4.0
dtype: float64

In [45]:
#For frame, it could rank row or column
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [46]:
frame.rank()

Unnamed: 0,b,a,c
0,3.0,1.5,2.0
1,4.0,3.5,3.0
2,1.0,1.5,4.0
3,2.0,3.5,1.0


In [47]:
frame.rank(axis=1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


## Axis indexes with Duplicate Labels

In [48]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj.index.is_unique
"""
Data selection is one of the main things that behaves differently with duplicates.
Indexing a label with multiple entries returns a Series, while single entries return a
scalar value
"""
obj['a']
# Dataframe behave similarly
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df.loc['a']

Unnamed: 0,0,1,2
a,-1.692585,0.750158,-1.513939
a,0.858648,1.824047,-2.816745


## Summarizing and Computing Descriptive Statistics
1. pandas objects are equipped with a set of common mathematical and statistical meth‐
ods. Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values
from the rows or columns of a DataFrame. Compared with the similar methods
found on NumPy arrays, they have built-in handling for missing data
2. count will count all non-nan value
3. describe will compute set of statistics for Series or each Dataframe column
4. min, max value
5. argmin, argmax will give you integer locations of minimum or maximum
6. idxmin, idxmax will get you label index
7. quantile ranging from 0 to 1
8. sum
9. mean
10. median
11. mad Mean absolute deviation from mean value
12. product product of all value
13. var simple variance of values
14. std sample standard deviation of values
15. skew sample skewness (third moment) of values
16. kurt sample kurtosis (forth moment) of values
17. cumsum cumulative sum of values
18. cummin, cummax cummlative minimum or maximum values
19. cumprod cumlative product
20. diff compute first arithmetic difference (useful for time series)
21. pct_change compute percent changes

In [49]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
                  index=list('abcd'), columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [50]:
#Calling DataFrame’s sum method returns a Series containing column sums:
df.sum()
#Passing axis='columns' or axis=1 sums across the columns instead:
#axis to reduce over
df.sum(axis=1)
"""
NA values are excluded unless the entire slice (row or column in this case) is NA.
This can be disabled with the skipna option
skipna default is True
level Reduce grouped by level if the axis is hierarchically indexed (MultiIndex)
"""
df.sum(axis=1, skipna=False)
df.mean(axis=1)

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

In [51]:
# idxmin and idxmax return indirect statistics like the index value
df.idxmax()
df.idxmin()

one    d
two    b
dtype: object

In [52]:
# other methods are accumulations
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [53]:
"""
Another type of method is neither a reduction nor an accumulation. describe is one
such example, producing multiple summary statistics in one shot
"""
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [54]:
# for non-numeric data, describe produces alternative summary statistics
obj = pd.DataFrame(list('aabc') * 4)
obj.describe()

Unnamed: 0,0
count,16
unique,3
top,a
freq,8


## Correlation and Covariance
1. Some summary statistics, like correlation and covariance, are computed from pairs of
arguments.

In [57]:
import pandas_datareader.data as web

In [59]:
all_data = {ticker : web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [62]:
price = pd.DataFrame({ticker : data['Adj Close'] for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker : data['Volume'] for ticker, data in all_data.items()})

In [66]:
# more detail about time series will be introduced in chapter 10
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-03-04,0.005029,-0.005532,-0.002399,0.005969
2019-03-05,-0.00182,-0.003973,-0.004988,0.012398
2019-03-06,-0.005754,-0.006527,0.000448,-0.003589
2019-03-07,-0.011575,-0.011827,-0.01217,-0.012575
2019-03-08,0.002377,-0.001995,0.001087,-0.000857


In [70]:
"""
The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance
"""
returns['AAPL'].corr(returns['MSFT'])
returns['AAPL'].cov(returns['MSFT'])
# also you could call them as attributes
returns.AAPL.corr(returns.MSFT)

0.4503744085572097

In [76]:
"""DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame"""
returns.cov()
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.372275,0.450374,0.457556
IBM,0.372275,1.0,0.48689,0.408303
MSFT,0.450374,0.48689,1.0,0.537389
GOOG,0.457556,0.408303,0.537389,1.0


In [80]:
"""Using DataFrame’s corrwith method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column"""
returns.corrwith(returns.AAPL)
"""
Passing a DataFrame computes the correlations of matching column names.
Pass axis=1 will do the row-by-row instead
In all cases, the data points are aligned by label before the correlation is computed.!!!!!!!!
"""
returns.corrwith(volume, axis=1)

Date
2009-12-31         NaN
2010-01-04    0.791554
2010-01-05    0.737339
2010-01-06    0.016912
2010-01-07    0.507373
2010-01-08   -0.779514
2010-01-11   -0.351055
2010-01-12   -0.288882
2010-01-13    0.886223
2010-01-14   -0.492756
2010-01-15   -0.290288
2010-01-19    0.903180
2010-01-20    0.281733
2010-01-21   -0.695777
2010-01-22   -0.247707
2010-01-25    0.801919
2010-01-26    0.866182
2010-01-27    0.793799
2010-01-28   -0.903431
2010-01-29   -0.965123
2010-02-01    0.102712
2010-02-02    0.407172
2010-02-03    0.411526
2010-02-04   -0.943249
2010-02-05    0.890386
2010-02-08   -0.362185
2010-02-09    0.452406
2010-02-10   -0.320464
2010-02-11    0.833539
2010-02-12    0.597527
                ...   
2019-01-25    0.433705
2019-01-28   -0.268018
2019-01-29   -0.601270
2019-01-30    0.821609
2019-01-31   -0.732114
2019-02-01   -0.371848
2019-02-04    0.851824
2019-02-05    0.778314
2019-02-06    0.365361
2019-02-07   -0.016718
2019-02-08    0.411706
2019-02-11   -0.939367
2019-0

## Unique values, value counts and membership

In [82]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
#The first function is unique , which gives you an array of the unique values in a Series:
uniques = obj.unique()

In [93]:
uniques.sort()
uniques
#Relatedly, value_counts computes a Series containing value frequencies
obj.value_counts()
"""The Series is sorted by value in descending order as a convenience. value_counts is
also available as a top-level pandas method that can be used with any array or
sequence"""
pd.value_counts(obj.values, ascending=False)

c    3
a    3
b    2
d    1
dtype: int64

In [109]:
"""isin performs a vectorized set membership check and can be useful in filtering a
dataset down to a subset of values in a Series or column in a DataFrame"""
mask = obj.isin(['b', 'd'])
obj[mask]
"""Related to isin is the Index.get_indexer method, which gives you an index array
from an array of possibly non-distinct values into another array of distinct values:"""
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'a', 'b'])
index = pd.Index(unique_vals)
index.get_indexer(to_match)
"""Integers from 0 to n - 1 indicating that the index at these positions matches the 
corresponding target values. Missing values in the target are marked by -1."""

array([0, 1, 2, 2, 0, 1])

In [114]:
data = pd.DataFrame({'Qu1' : [1, 3, 4, 3, 4],
                     'Qu2' : [2, 3, 1, 2, 3],
                     'Qu3' : [1, 5, 2, 4, 4]})
data.apply(pd.value_counts).fillna(0)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


## Handling the missing value
1. python none will also treated as NA in object arrays
2. dropna will filter axis labels based on whether values for each label have missing values,
    with varying thresholds for how much missing data to tolerate
3. fillna will fill in missing data with some value or using an interpolation method 
    such as 'ffill' or 'bfill'

In [119]:
string_data = pd.Series(['asda', 'dasda', np.nan, 'sdasdsa'])
string_data.isnull()
string_data.notnull
# string_data[0] = None
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [121]:
string_data.fillna(method='ffill')

0       asda
1      dasda
2      dasda
3    sdasdsa
dtype: object

In [123]:
"""While you always have the option to
do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful.
On a Series, it returns the Series with only the non-null data and index values
"""
string_data.dropna()
# equivalent to 
string_data[string_data.notnull()]

0       asda
1      dasda
3    sdasdsa
dtype: object

In [131]:
"""With DataFrame objects, things are a bit more complex. You may want to drop rows
or columns that are all NA or only those containing any NAs. dropna by default drops
any row containing a missing value"""
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], 
                     [np.nan, 6.5, 3.]])
cleaned = data.dropna()
cleaned
# passing how='all' will only drop rows that are all NAs
# the default value of how is any or you could change it to all
data.dropna(how='all')
# if you want to drop column, you set axis to 1
data.dropna(axis=1, how='all')


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [132]:
"""A related way to filter out DataFrame rows tends to concern time series data. Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument"""
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan

Unnamed: 0,0,1,2
0,0.59489,,
1,-1.750481,,
2,0.057361,,0.7484
3,1.66155,,0.451061
4,-0.385423,1.660343,0.647623
5,1.315059,0.290741,-0.903033
6,1.372561,0.665482,0.310616


In [142]:
# set threshold to NA Require that many non-NA values.
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.057361,,0.7484
3,1.66155,,0.451061
4,-0.385423,1.660343,0.647623
5,1.315059,0.290741,-0.903033
6,1.372561,0.665482,0.310616


## Filling in missing data
1. Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways. For most pur‐
poses, the fillna method is the workhorse function to use. Calling fillna with a
constant replaces missing values with that value
2. value is the scalar value or dict-like object to use to fill missing values
3. method Interpolation, by default is 'ffill' if function called with no other arguments
4. axis axis to fill on; default is 0
5. inplace modify the calling object without producing a copy
6. set the maximum number of consecutive periods to fill (forward or backward)

In [143]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.59489,0.0,0.0
1,-1.750481,0.0,0.0
2,0.057361,0.0,0.7484
3,1.66155,0.0,0.451061
4,-0.385423,1.660343,0.647623
5,1.315059,0.290741,-0.903033
6,1.372561,0.665482,0.310616


In [144]:
"""Calling fillna with a dict, you can use a different fill value for each column:
fillna default returns a new object, but you can modify the existing object in-place:"""
df.fillna({1: 11, 2:0})

Unnamed: 0,0,1,2
0,0.59489,11.0,0.0
1,-1.750481,11.0,0.0
2,0.057361,11.0,0.7484
3,1.66155,11.0,0.451061
4,-0.385423,1.660343,0.647623
5,1.315059,0.290741,-0.903033
6,1.372561,0.665482,0.310616


In [146]:
"""The same interpolation methods available for reindexing can be used with fillna"""
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df

Unnamed: 0,0,1,2
0,-0.066665,1.157913,-0.697975
1,0.292313,-0.389585,-0.16415
2,-0.19105,,1.314332
3,1.317004,,-1.300259
4,-0.341981,,
5,0.628667,,


In [147]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.066665,1.157913,-0.697975
1,0.292313,-0.389585,-0.16415
2,-0.19105,-0.389585,1.314332
3,1.317004,-0.389585,-1.300259
4,-0.341981,-0.389585,-1.300259
5,0.628667,-0.389585,-1.300259


In [148]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.066665,1.157913,-0.697975
1,0.292313,-0.389585,-0.16415
2,-0.19105,-0.389585,1.314332
3,1.317004,-0.389585,-1.300259
4,-0.341981,,-1.300259
5,0.628667,,-1.300259


## Hierarchical Indexing
1. Hierarchical indexing is an important feature of pandas that enables you to have mul‐
tiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for
you to work with higher dimensional data in a lower dimensional form.

In [152]:
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                                              [1, 2, 3, 1, 3, 1 ,2 ,2 ,3]])
data

a  1    1.403043
   2    0.016535
   3   -0.199138
b  1    0.153589
   3    0.584001
c  1   -1.545812
   2   -0.137538
d  2    0.020457
   3   -0.395125
dtype: float64

In [163]:
data.index
"""With a hierarchically indexed object, so-called partial indexing is possible, enabling
you to concisely select subsets of the data"""
data['b']
"""Selection is even possible from an “inner” level:"""
data[:, 1]

a    1.403043
b    0.153589
c   -1.545812
dtype: float64

In [165]:
"""Hierarchical indexing plays an important role in reshaping data and group-based
operations like forming a pivot table. For example, you could rearrange the data into
a DataFrame using its unstack method"""
data.unstack()
# inverse method is stack()
data.unstack().stack()

a  1    1.403043
   2    0.016535
   3   -0.199138
b  1    0.153589
   3    0.584001
c  1   -1.545812
   2   -0.137538
d  2    0.020457
   3   -0.395125
dtype: float64

In [169]:
"""With a DataFrame, either axis can have a hierarchical index:"""
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
frame
"""The hierarchical levels can have names"""
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [172]:
frame.Ohio

NameError: name 'MultiIndex' is not defined

## Reordering and Sorting levels
1. At times you will need to rearrange the order of the levels on an axis or sort the data
by the values in one specific level. The swaplevel takes two level numbers or names
and returns a new object with the levels interchanged (but the data is otherwise
unaltered)

In [173]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [176]:
"""sort_index , on the other hand, sorts the data using only the values in a single level.
When swapping levels, it’s not uncommon to also use sort_index so that the result is
lexicographically sorted by the indicated level:"""
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


# Summary Statistics by Level
1. Many descriptive and summary statistics on DataFrame and Series have a level
option in which you can specify the level you want to aggregate by on a particular
axis

In [177]:
frame.sum(level=1)

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [179]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


## Indexing with a Dataframe's columns
1. It’s not unusual to want to use one or more columns from a DataFrame as the row
index; alternatively, you may wish to move the row index into the DataFrame’s col‐
umns.

In [180]:
frame = pd.DataFrame({'a' : range(7), 'b' : range(7, 0, -1), 
                     'c' : ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                     'd' : [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [185]:
"""DataFrame’s set_index function will create a new DataFrame using one or more of
its columns as the index"""
frame2 = frame.set_index(['c', 'd'])

In [183]:
"""By default the columns are removed from the dataframe, though you can leave them in"""
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [186]:
"""reset_index , on the other hand, does the opposite of set_index ; the hierarchical
index levels are moved into the columns"""
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1
