In [72]:
import pandas as pd

## Market data processing

There are thousands of files that all span different lengths of time and values. The ultimate goal is to know if the market is moving in a staistically significant positive or negative direction. In order to make that determination, each individual relavant entity must be known as moving in a significant direction. As a stock like Apple might move hundreds of dollars in a month and not be considered significant but a start up companyu might move the same amount and might be quite significant for it. To try and minimize the impact of individual stocks or companies having a significant price movement, only the ETF files will be processed as they group similar categories of companies together and reduce seen market volitility.

### Preprocessing testing

This file develops the import and processing of an ETF file. Attributes that need to be retained are the ETF name, the year, and if the price made a significant positive or negative movement. This process will be implemented into a loop in another script to process all the ETFs.

#### Reduce pricing to years instead of days.

In [73]:
test_etf = pd.read_csv("Data/markets_test/aadr.us.txt")

In [74]:
"Min", test_etf.min(), "Max", test_etf.max()

('Min',
 Date       2010-07-21
 Open           23.936
 High           23.946
 Low            23.867
 Close          23.946
 Volume              2
 OpenInt             0
 dtype: object,
 'Max',
 Date       2017-11-10
 Open            58.62
 High            58.72
 Low              57.7
 Close           58.43
 Volume         106139
 OpenInt             0
 dtype: object)

In [75]:
test_etf.Date.info()

<class 'pandas.core.series.Series'>
RangeIndex: 1565 entries, 0 to 1564
Series name: Date
Non-Null Count  Dtype 
--------------  ----- 
1565 non-null   object
dtypes: object(1)
memory usage: 12.4+ KB


In [76]:
test_etf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1565 entries, 0 to 1564
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Date     1565 non-null   object 
 1   Open     1565 non-null   float64
 2   High     1565 non-null   float64
 3   Low      1565 non-null   float64
 4   Close    1565 non-null   float64
 5   Volume   1565 non-null   int64  
 6   OpenInt  1565 non-null   int64  
dtypes: float64(4), int64(2), object(1)
memory usage: 85.7+ KB


In [77]:
#drop unneeded coulmns: OpenInt, Volume, Low, High
drop_columns = ["OpenInt", "Volume", "Low", "High"]
test_drop = test_etf.drop(drop_columns, axis = 1, inplace = True)

In [78]:
test_etf.head()

Unnamed: 0,Date,Open,Close
0,2010-07-21,24.333,23.946
1,2010-07-22,24.644,24.487
2,2010-07-23,24.759,24.507
3,2010-07-26,24.624,24.595
4,2010-07-27,24.477,24.517


In [79]:
# Add a columns that shows sum of "Open" and "Close" for daily change
test_etf['change']= test_etf['Open'] - test_etf['Close']

In [80]:
test_etf.head()

Unnamed: 0,Date,Open,Close,change
0,2010-07-21,24.333,23.946,0.387
1,2010-07-22,24.644,24.487,0.157
2,2010-07-23,24.759,24.507,0.252
3,2010-07-26,24.624,24.595,0.029
4,2010-07-27,24.477,24.517,-0.04


In [82]:
# replace the full date with just the year
for i in range(len(test_etf)):
    test_etf.Date.loc[i] = test_etf.Date.loc[i][0:4]

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  test_etf.Date.loc[i] = test_etf.Date.loc[i][0:4]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_etf.Date.

In [84]:
test_etf.tail()

Unnamed: 0,Date,Open,Close,change
1560,2017,57.61,57.65,-0.04
1561,2017,57.29,57.285,0.005
1562,2017,57.31,57.49,-0.18
1563,2017,57.23,56.9265,0.3035
1564,2017,56.96,56.4,0.56


Index(['2015', '2016', '2014', '2017', '2013', '2011', '2012', '2010'], dtype='object', name='Date')