## Market Data:
    Preprocessing Development/Testing Scripts

There are thousands of files that all span different lengths of time and values. The ultimate goal is to know if the market is moving in a staistically significant positive or negative direction. In order to make that determination, each individual relavant entity must be known as moving in a significant direction. As a stock like Apple might move hundreds of dollars in a month and not be considered significant but a start up companyu might move the same amount and might be quite significant for it. To try and minimize the impact of individual stocks or companies having a significant price movement, only the ETF files will be processed as they group similar categories of companies together and reduce seen market volitility.

This file develops the import and processing of an ETF file. Attributes that need to be retained are the ETF name, the year, and if the price made a significant positive or negative movement. This process will be implemented into a loop in another script to process all the ETFs.

As this is a script to show the process the ETF data is going through, print statements are included however they are unneccessary during the actual processing.

In [309]:
import pandas as pd
import os 

Obtain the file path and extract the ETF name

In [310]:
# assigns the relative file path. This will be an object instead of a named file path in the looped script
file = "Data/markets_test/aadr.us.txt"
#takes the string after the last '/'
etf_name = os.path.basename(os.path.normpath(file))
#removes the .txt of the file string
etf_name = etf_name.replace('.txt', '')
print(etf_name)

aadr.us


### Familiarize with data
  ###Not needed during looping script!!##

In [311]:
test_etf = pd.read_csv(file)

In [312]:
"Min", test_etf.min(), "Max", test_etf.max()

('Min',
 Date       2010-07-21
 Open           23.936
 High           23.946
 Low            23.867
 Close          23.946
 Volume              2
 OpenInt             0
 dtype: object,
 'Max',
 Date       2017-11-10
 Open            58.62
 High            58.72
 Low              57.7
 Close           58.43
 Volume         106139
 OpenInt             0
 dtype: object)

In [313]:
test_etf.Date.info()

<class 'pandas.core.series.Series'>
RangeIndex: 1565 entries, 0 to 1564
Series name: Date
Non-Null Count  Dtype 
--------------  ----- 
1565 non-null   object
dtypes: object(1)
memory usage: 12.4+ KB


In [314]:
test_etf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1565 entries, 0 to 1564
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Date     1565 non-null   object 
 1   Open     1565 non-null   float64
 2   High     1565 non-null   float64
 3   Low      1565 non-null   float64
 4   Close    1565 non-null   float64
 5   Volume   1565 non-null   int64  
 6   OpenInt  1565 non-null   int64  
dtypes: float64(4), int64(2), object(1)
memory usage: 85.7+ KB


#### Drop unneeded columns: OpenInt, Volume, Low, High

In [315]:
drop_columns = ["OpenInt", "Volume", "Low", "High"]
test_drop = test_etf.drop(drop_columns, axis = 1, inplace = True)

In [316]:
test_etf.head()

Unnamed: 0,Date,Open,Close
0,2010-07-21,24.333,23.946
1,2010-07-22,24.644,24.487
2,2010-07-23,24.759,24.507
3,2010-07-26,24.624,24.595
4,2010-07-27,24.477,24.517


In [317]:
# Add a column that shows sum of "Open" and "Close" for daily change

test_etf['change']= test_etf['Open'] - test_etf['Close']

In [318]:
test_etf.head()

Unnamed: 0,Date,Open,Close,change
0,2010-07-21,24.333,23.946,0.387
1,2010-07-22,24.644,24.487,0.157
2,2010-07-23,24.759,24.507,0.252
3,2010-07-26,24.624,24.595,0.029
4,2010-07-27,24.477,24.517,-0.04


#### Add the name of the ETF

In [319]:
test_etf['ETF_name'] = etf_name

#### Adjust the Date column for better processing

In [320]:
#convert sting to datetime format
test_etf['Date'] = pd.to_datetime(test_etf.Date) 

In [321]:
# seperate year, month and day
#test_etf['Year'], test_etf['Month'], test_etf['Day'] = test_etf.Date.dt.year, test_etf.Date.dt.month, test_etf.Date.dt.day
test_etf['Year'] = test_etf.Date.dt.year

In [322]:
#drop original date column
#test_etf.drop(columns = 'Date', axis =1, inplace = True)

In [323]:
test_etf.head()

Unnamed: 0,Date,Open,Close,change,ETF_name,Year
0,2010-07-21,24.333,23.946,0.387,aadr.us,2010
1,2010-07-22,24.644,24.487,0.157,aadr.us,2010
2,2010-07-23,24.759,24.507,0.252,aadr.us,2010
3,2010-07-26,24.624,24.595,0.029,aadr.us,2010
4,2010-07-27,24.477,24.517,-0.04,aadr.us,2010


#### Create Open and Close values for each Year

In [347]:
# create dataframe grouped by year
year_df = test_etf.groupby('Year')

In [348]:
# Find the first and last day of the year for each year
earliest_dates = year_df.Date.min()
last_dates = year_df.Date.max()

In [356]:
test_etf.query("Date in @earliest_dates or Date in @last_dates")

Unnamed: 0,Date,Open,Close,change,ETF_name,Year
0,2010-07-21,24.333,23.946,0.387,aadr.us,2010
112,2010-12-31,28.823,28.928,-0.105,aadr.us,2010
113,2011-01-03,29.065,29.141,-0.076,aadr.us,2011
309,2011-12-30,27.155,27.252,-0.097,aadr.us,2011
310,2012-01-03,28.087,27.766,0.321,aadr.us,2012
459,2012-12-31,30.107,29.853,0.254,aadr.us,2012
460,2013-01-02,30.741,30.593,0.148,aadr.us,2013
662,2013-12-31,37.059,36.532,0.527,aadr.us,2013
663,2014-01-02,36.921,36.356,0.565,aadr.us,2014
883,2014-12-31,36.687,36.687,0.0,aadr.us,2014


In [342]:
# dreate dataframe that to place needed values in
output_df = pd.DataFrame(columns = [ 'Year','ETF_name','Year_Open', 'Year_Close', 'Year_Change'])


In [None]:
# Loop through dates, earliest records opening numbers, latest records closing number
for d in test_etf.Date:
    
    if d in earliest_dates:
        output_df.append(test_etf

In [343]:
output_df

Unnamed: 0,Year,ETF_name,Year_Open,Year_Close,Year_Change


#### Set upper and lower boundries for significant change
 This set any value over the 80 quantile or under the 20 quantile as notable

In [232]:
# sets the 80% and 20% boundries
etf_high_lim = test_etf.change.quantile(.8)
etf_low_lim = test_etf.change.quantile(.2)
etf_high_lim, etf_low_lim

(0.17419999999999908, -0.12020000000000336)

#### Reduce pricing to years instead of days.

In [139]:
# replace the full date with just the year
for i in range(len(test_etf)):
    test_etf.loc[i,'Date'] = test_etf.loc[i,'Date'][0:4]

In [140]:
# group data by year 
etf_by_year = test_etf.groupby('Date')

#calc the mean for each year and set it into a new DF 
etf_means = etf_by_year.mean()
etf_means

Unnamed: 0_level_0,Open,Close,change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,26.865071,26.871699,-0.006628
2011,28.726056,28.73802,-0.011964
2012,28.667567,28.668487,-0.00092
2013,33.480458,33.473271,0.007187
2014,36.895077,36.847964,0.047113
2015,38.015524,37.983107,0.032416
2016,38.623239,38.565904,0.057335
2017,48.974968,48.915607,0.059361


In [152]:
# compare the mean value of each year to the 20 and 80 limits previously set and 
# create a column documenting if it was outside the low/high lim
etf_means['bad_year'] = 'no'
etf_means['good_year'] = 'no'
for i in range(len(etf_means)):
    print(etf_means.change.iloc[i])
    if etf_means.change.iloc[i] < etf_low_lim:
        etf_means['bad_year'].iloc[i] = 'yes'
        print('changed bad_year')
    elif etf_means.change.iloc[i] > etf_high_lim:
        etf_means['good_year'].iloc[i] = 'yes'
        print('changed good_year')
    else: 
        continue

-0.006628318584070786
-0.011964467005075976
-0.0009199999999997743
0.007187192118226144
0.04711312217194579
0.0324163090128755
0.05733478260869595
0.05936055045871555


In [153]:
etf_means

Unnamed: 0_level_0,Open,Close,change,bad_year,good_year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,26.865071,26.871699,-0.006628,no,no
2011,28.726056,28.73802,-0.011964,no,no
2012,28.667567,28.668487,-0.00092,no,no
2013,33.480458,33.473271,0.007187,no,no
2014,36.895077,36.847964,0.047113,no,no
2015,38.015524,37.983107,0.032416,no,no
2016,38.623239,38.565904,0.057335,no,no
2017,48.974968,48.915607,0.059361,no,no


In [144]:
# creates the final dataframe with named etf for merging into the dataframe housing all the etf data
output_df = etf_means.drop(columns = ['Open','Close', 'change'],axis =1)


In [145]:
output_df

Unnamed: 0_level_0,bad_year,good_year,etf
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010,no,no,aadr.us
2011,no,no,aadr.us
2012,no,no,aadr.us
2013,no,no,aadr.us
2014,no,no,aadr.us
2015,no,no,aadr.us
2016,no,no,aadr.us
2017,no,no,aadr.us
