# Financial and Economic Data Applications 
The use of Python in the financial industry has been increasing rapidly since 2005, led largely by the maturation of 
libararies and the availability of skilled python programmers. Institutions have found that Python is well-suited of 
as an interactive analysis environment as well as enabling robust systems to be developed often in a fraction of the time it would have taken in JAVA or C++... 

# Data Munging Topics

In [2]:
# Time Series and Cross-Section Alignment 
# On of the most time-consuming issues in working with the financial data is the so-called data alignment problem. 
# Two related time series may have indexes that don't line up perfectly or two DataFrame objects might have columns or row 
# labels that don't match. 

# Operations with Time Series of Different Frequencies
Economic time series often of annual, quarterly, and monthly frequencies. It is often necessary to perform operations.
Some are completely irregular, for example, earnings revisions for a stock may arrive at any time. 
The two main tools for frequency conversion and realignment are the 'resample' and 'reindex' methods. 
'resample' converts data to a fixed frequency while 'reindex' conforms data to a new index. Both support optional interpolation logic. 

In [3]:
# consider a small weekly time series of stock prices and volumes. 
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from datetime import datetime
from datetime import timedelta
import matplotlib.pyplot as plt

# Let's create a small time series with weekly frequency. 
# The data is about the stock price and volume of a company.
# We will use the following data for our example.  

ts1 = Series(
    np.random.randn(3),
    index = pd.date_range('2012-6-13', periods = 3, freq = 'W-WED')
            )

In [4]:
ts1

2012-06-13   -0.407684
2012-06-20   -0.501774
2012-06-27    0.737268
Freq: W-WED, dtype: float64

In [5]:
# If you resample this to business daily(Monday-Friday) frequency, you get holes on the days where there is no data. 
# To fill the holes, you can use the asfreq method. 
ts1.resample('B') 

<pandas.core.resample.DatetimeIndexResampler object at 0x171e239d0>

In [7]:
print(ts1.resample('B')) 

DatetimeIndexResampler [freq=<BusinessDay>, axis=0, closed=left, label=left, convention=start, origin=start_day]


In [10]:
ts1.resample('B').ffill()

2012-06-13   -0.407684
2012-06-14   -0.407684
2012-06-15   -0.407684
2012-06-18   -0.407684
2012-06-19   -0.407684
2012-06-20   -0.501774
2012-06-21   -0.501774
2012-06-22   -0.501774
2012-06-25   -0.501774
2012-06-26   -0.501774
2012-06-27    0.737268
Freq: B, dtype: float64

In [11]:
# In practice, upsampling lower frequency data to a higher, regular frequency is a fine 
# solution, but in the more general irregular time series case it may be a poor fit. 
# Conisder an irregular sampled time series from the same general time period:

dates = pd.DataFrame(['2012-6-12', '2012-6-17', '2012-6-18', '2012-6-21', '2012-6-22']) 

In [13]:
ts2 = pd.Series(np.random.rand(5), index=dates[0])
ts2

0
2012-6-12    0.124310
2012-6-17    0.090882
2012-6-18    0.960148
2012-6-21    0.640752
2012-6-22    0.694223
dtype: float64

# Using periods instead of timestamps 

In [14]:
# Periods(representing time spans) provide an alternate means of working with different frequency time series. 
# especially financial or economic series with annual or quarterly frequency having a particular reporting convention. 
# For example, a company might announce its quarterly earnings with fiscal year ending in June, thus having Q-Jun frequency. 
# Consider a pair of macroeconomic time series related to GDP and inflation. 

gdp = Series( [1.78, 1.94, 2.08, 2.01, 2.15], index = pd.period_range('1984Q2', periods = 5, freq = 'Q-DEC') ) 
inflation = Series( [0.025, 0.045, 0.037, 0.04], index = pd.period_range('1982', periods = 4, freq = 'A-DEC') ) 

In [15]:
gdp 

1984Q2    1.78
1984Q3    1.94
1984Q4    2.08
1985Q1    2.01
1985Q2    2.15
Freq: Q-DEC, dtype: float64

In [16]:
inflation

1982    0.025
1983    0.045
1984    0.037
1985    0.040
Freq: A-DEC, dtype: float64

In [17]:
inf_q = inflation.asfreq('Q-DEC', how = 'end') 

In [18]:
inf_q 

1982Q4    0.025
1983Q4    0.045
1984Q4    0.037
1985Q4    0.040
Freq: Q-DEC, dtype: float64

In [19]:
# That time series can then be reindexed with forward-filling to match gdp: 
inf_q.reindex(gdp.index, method = 'ffill') 

1984Q2    0.045
1984Q3    0.045
1984Q4    0.037
1985Q1    0.037
1985Q2    0.037
Freq: Q-DEC, dtype: float64

# Time of day and "as of" Data selection 

In [20]:
# Suppose you have a long time series containing intraday market data and you want to extract the prices at a particular time of day 
# on each day of the data. 
# What if the data are irregular such that observations do not fall exactly on the desired time? 
# In practice this task can make for error-prone data muning if you are not careful. 
# Here is a small sample of such data.
# Make an intraday date range and time series 

rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq = 'T')  

In [23]:
rng = rng.append([rng + pd.DateOffset(days=i) for i in range(1, 4)]).astype('datetime64[ns]')


In [24]:
ts = Series(np.arange(len(rng), dtype = float), index = rng) 

In [25]:
ts 

2012-06-01 09:30:00       0.0
2012-06-01 09:31:00       1.0
2012-06-01 09:32:00       2.0
2012-06-01 09:33:00       3.0
2012-06-01 09:34:00       4.0
                        ...  
2012-06-04 15:55:00    1555.0
2012-06-04 15:56:00    1556.0
2012-06-04 15:57:00    1557.0
2012-06-04 15:58:00    1558.0
2012-06-04 15:59:00    1559.0
Length: 1560, dtype: float64

In [26]:
# Indexing with a Python datetime.time object will extract values at those times 
from datetime import time
ts[time(10, 0)] 

2012-06-01 10:00:00      30.0
2012-06-02 10:00:00     420.0
2012-06-03 10:00:00     810.0
2012-06-04 10:00:00    1200.0
dtype: float64

In [27]:
# Under the hood, this uses an instance method at_time (available on individual time series and DataFrame objects alike): 
ts.at_time(time(10, 0))

2012-06-01 10:00:00      30.0
2012-06-02 10:00:00     420.0
2012-06-03 10:00:00     810.0
2012-06-04 10:00:00    1200.0
dtype: float64

In [28]:
# You can select values between two times using the related between_time method: 
ts.between_time(time(10,0), time(10, 1)) 

2012-06-01 10:00:00      30.0
2012-06-01 10:01:00      31.0
2012-06-02 10:00:00     420.0
2012-06-02 10:01:00     421.0
2012-06-03 10:00:00     810.0
2012-06-03 10:01:00     811.0
2012-06-04 10:00:00    1200.0
2012-06-04 10:01:00    1201.0
dtype: float64

In [29]:
# As mentioned above, it might be the case that no data actually fall exactly at a time like 10 AM, but you might want to know the last known value at 10 AM. 
# Set most of the time series randomly to NA values: 

indexer = np.sort(np.random.permutation(len(ts))[700:]) 

In [30]:
irr_ts = ts.copy() 

In [31]:
irr_ts[indexer]=np.nan

In [32]:
irr_ts['2012-06-01 09:50':'2012-06-01 10:00'] 

2012-06-01 09:50:00    20.0
2012-06-01 09:51:00     NaN
2012-06-01 09:52:00     NaN
2012-06-01 09:53:00    23.0
2012-06-01 09:54:00    24.0
2012-06-01 09:55:00     NaN
2012-06-01 09:56:00     NaN
2012-06-01 09:57:00     NaN
2012-06-01 09:58:00     NaN
2012-06-01 09:59:00    29.0
2012-06-01 10:00:00    30.0
dtype: float64

In [33]:
# By passing an array of timestamps to the "asof" method, you will obtain an array of the last valid(not-NaN) 
# values at or before each timestamp. So we construct an array of timestamps at 10 AM on each of the days in the time series: 

selection = pd.date_range('2012-06-01 10:00', periods = 4, freq = 'B') 

In [34]:
irr_ts.asof(selection) 

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00    1200.0
2012-06-05 10:00:00    1558.0
2012-06-06 10:00:00    1558.0
Freq: B, dtype: float64

# Splicing together data sources

In [35]:
# Chapter 7 described a number of strategies for merging together two related data sets. 
# In a financial or economic context, there are a few widely occurring use cases for this type of operations: 
#  1. Switching from one data source (a time series or collection of time series) to another at a specific point in time. 
#  2. Patching missing values in a time series at the beginning, middle, or end of a another time series. 
#  3. Completely replacing the data for a subset of symbols (countries, asset tickers and so on) 

In [36]:
# In the first case, switching from one set of time series to another at a specific instant, it is matter of splicing together two TimeSeries or DataFrame objects using pandas.concat. 
# Here is a small example of such an operation. 

data1 = DataFrame(np.ones((6, 3), dtype = float), columns = ['a', 'b', 'c'], index = pd.date_range('6/12/2012', periods = 6)) 

In [37]:
data2 = DataFrame(np.ones((6, 3), dtype = float) * 2, columns = ['a', 'b', 'c'], index = pd.date_range('6/13/2012', periods = 6))

In [39]:
spliced = pd.concat([data1.loc[:'2012-06-14'], data2.loc['2012-06-15':]]) 

In [40]:
spliced

Unnamed: 0,a,b,c
2012-06-12,1.0,1.0,1.0
2012-06-13,1.0,1.0,1.0
2012-06-14,1.0,1.0,1.0
2012-06-15,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0


In [41]:
# Suppose in a similar example that data1 was missing a time series present in data2. 

data2 = DataFrame(np.ones((6, 4), dtype = float) * 2, columns = ['a', 'b', 'c', 'd'], index = pd.date_range('6/13/2012', periods = 6)) 

In [42]:



sliced = pd.concat([data1.loc[:'2012-06-14'], data2.loc['2012-06-15':]]) 

In [43]:
sliced 

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,
2012-06-14,1.0,1.0,1.0,
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [44]:
# Using combine_first, you can bring in data from before the splice point to extend the history for 'd' item: 

sliced_filled = sliced.combine_first(data2) 

In [45]:
sliced_filled

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [46]:
# Since data2 does not have any values for 2012-06-12, no values are filled on that day. 
# DataFrame has a related method, "update" for performing in-place updates. You have to pass "overwrite=False" 
# to make it only fill the holes: 
spliced.update(data2, overwrite = False) 

In [47]:
spliced

Unnamed: 0,a,b,c
2012-06-12,1.0,1.0,1.0
2012-06-13,1.0,1.0,1.0
2012-06-14,1.0,1.0,1.0
2012-06-15,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0


In [48]:
# To replace the data for a subset of symbols, you can use any of the above techniques, 
# but sometimes it's simpler to just set the columns directly with DataFrame indexing. 
# Here is an example of that: 

cp_spliced = spliced.copy()

In [49]:
cp_spliced[['a', 'c']] = data1[['a', 'c']]

In [50]:
cp_spliced

Unnamed: 0,a,b,c
2012-06-12,1.0,1.0,1.0
2012-06-13,1.0,1.0,1.0
2012-06-14,1.0,1.0,1.0
2012-06-15,1.0,2.0,1.0
2012-06-16,1.0,2.0,1.0
2012-06-17,1.0,2.0,1.0
2012-06-18,,2.0,


# Return Indexes and Cumulative Returns 
In a financial context, returns usually refer to percent changes in the price of an asset. 

In [55]:
# %pip install yfinance

In [58]:
import yfinance as yf
price = yf.download('AAPL', start = '2014-01-01', end = '2015-01-01') 

[*********************100%%**********************]  1 of 1 completed


In [59]:
price

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-02,19.845715,19.893929,19.715000,19.754642,17.296659,234684800
2014-01-03,19.745001,19.775000,19.301071,19.320715,16.916719,392467600
2014-01-06,19.194643,19.528570,19.057142,19.426071,17.008974,412610800
2014-01-07,19.440001,19.498571,19.211430,19.287144,16.887325,317209200
2014-01-08,19.243214,19.484285,19.238930,19.409286,16.994270,258529600
...,...,...,...,...,...,...
2014-12-24,28.145000,28.177500,28.002501,28.002501,25.034250,57918400
2014-12-26,28.025000,28.629999,28.002501,28.497499,25.476780,134884000
2014-12-29,28.447500,28.692499,28.424999,28.477501,25.458895,110395600
2014-12-30,28.410000,28.480000,28.027500,28.129999,25.148235,119526000


In [60]:
price1 = price[['Adj Close']]

In [61]:
price1[-5:]

Unnamed: 0_level_0,Adj Close
Date,Unnamed: 1_level_1
2014-12-24,25.03425
2014-12-26,25.47678
2014-12-29,25.458895
2014-12-30,25.148235
2014-12-31,24.669947


In [62]:
# For Apple, which has no dividends, computing the cumulative percent return between two points in time requires computing only the percen change in the price. 
returns = price1.pct_change() 

In [63]:
returns 

Unnamed: 0_level_0,Adj Close
Date,Unnamed: 1_level_1
2014-01-02,
2014-01-03,-0.021966
2014-01-06,0.005453
2014-01-07,-0.007152
2014-01-08,0.006333
...,...
2014-12-24,-0.004709
2014-12-26,0.017677
2014-12-29,-0.000702
2014-12-30,-0.012202


In [64]:
# For other stocks with dividend payouts, computing how much money you make from holding a stock can be more complicated. 
# The adjusted close values used here have been adjusted for splits and dividends.
# However,in all cases, it's quite common to derive a "return index". which is a time series indicating the value of a unit investment(one dollay say) 
# Many assumptions can underlie the return index, for example, some will choose to reinvest profit and others not. 
# In the case of Apple, we can compute a simple return index using "cumprod" 

returns = price.pct_change() 

In [65]:
ret_index = (1 + returns).cumprod() 

In [66]:
ret_index[0]=1 # set first value to 1 

In [68]:
ret_index[0]

Date
2014-01-02    1
2014-01-03    1
2014-01-06    1
2014-01-07    1
2014-01-08    1
             ..
2014-12-24    1
2014-12-26    1
2014-12-29    1
2014-12-30    1
2014-12-31    1
Name: 0, Length: 252, dtype: int64

In [69]:
# with a return index in hand, computing cumulative returns at a particular resolution is simple. 
# For example, to compute the 1-month return for each day, you can use the "shift" method:
m_returns = ret_index.resample('BM').last().pct_change() 

In [70]:
m_returns['2014']

  m_returns['2014']


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,0
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2014-01-31,,,,,,,
2014-02-28,0.06846,0.06225,0.057887,0.051219,0.057511,-0.199718,0.0
2014-03-31,0.019184,0.015129,0.02645,0.019953,0.019953,-0.54655,0.0
2014-04-30,0.099049,0.108393,0.100517,0.099396,0.099397,1.707316,0.0
2014-05-30,0.076505,0.074638,0.066294,0.072718,0.078709,0.235152,0.0
2014-06-30,0.010533,0.018536,0.025012,0.027662,0.027662,-0.649075,0.0
2014-07-31,0.05494,0.039688,0.035183,0.028731,0.028732,0.148754,0.0
2014-08-29,0.058666,0.055926,0.072065,0.072176,0.077508,-0.215471,0.0
2014-09-30,-0.01993,-0.013217,-0.01634,-0.017073,-0.017073,0.239244,0.0
2014-10-31,0.071422,0.064014,0.066448,0.07196,0.07196,-0.192255,0.0


# Group Transforms and Analysis 

In [71]:
# Let's consider a collection of hypothetical stock portfolios. Randomly generate a broad universe of 2000 tickers. 

import random; random.seed(0)
import string
N =1000 
def rands(n):
    choices = string.ascii_uppercase
    return ''.join([random.choice(choices) for _ in range(n)])
tickers = np.array([rands(5) for _ in range(N)]) 

In [72]:
# Then create a DataFrame containing 3 columns representing the hypothetical, but random portfolios for a subset of tickers. 
M = 500
df = DataFrame({
    'Momentum': np.random.randn(M) / 200 + 0.03,
    'Value': np.random.randn(M) / 200 + 0.08,
    'ShortInterest': np.random.randn(M) / 200 - 0.02
}, index = tickers[:M]) 