# Data Cleaning Appendix

## Data collection and cleaning

Our team was unable to find a dataset that contained pricing data for the S&P 500 over the last 10 years. Therefore, in order to gather this data into a workable dataset, we utilized a list of the S&P 500 companies along with the Yahoo Finance API to create our own dataset.

Source data:
- Data of current S&P 500 Company from https://github.com/datasets/s-and-p-500-companies/blob/master/data/constituents.csv
- Yahoo Finance API: http://theautomatic.net/yahoo_fin-documentation/

These files can be downloaded in the Google Drive folder https://drive.google.com/drive/folders/1I_hP_G51eLKYcYOwQKrcTGOimkLhTL52?usp=sharing

In [None]:
# 1. Convert data from S&P 500 Companies list into text file with tickers of each company
spList = pd.read_csv('Source Data/constituents.csv')
spListTick = spList["Symbol"].tolist()
spListTick.append('SPY') # adding SPY into list
textfile = open("companies.txt", "w")
for element in spListTick:
    textfile.write(element + " ")
textfile.close()

1. We first downloaded a csv file on the current S&P 500 companies and their associated sectors from https://github.com/datasets/s-and-p-500-companies/blob/master/data/constituents.csv. We converted the list of companies to a txt file to be used to later call the Yahoo Finance API. We also wrote the S&P 500 Index into the list as well (with the ticker being SPY) in order to gain some more data on the overall market's performance.

In [None]:
# 1. Convert data from S&P 500 Companies list into text file with tickers of each company
spList = pd.read_csv('Source Data/constituents.csv')
spListTick = spList["Symbol"].tolist()
spListTick.append('SPY') # adding SPY into list
textfile = open("companies.txt", "w")
for element in spListTick:
    textfile.write(element + " ")
textfile.close()

In [None]:
# install yahoo finance API
import sys
!pip3 install yahoo-fin

In [None]:
# 2. 
import yahoo_fin.stock_info as si
import pandas as pd

with open("companies.txt") as file:
    lines = file.readlines()
    companyList = [line.rstrip() for line in lines]
stockData = {}
sum_df=pd.DataFrame()
for ticker in companyList:
    try:
        stockData[ticker] = si.get_data(ticker, start_date="01/01/2010", end_date="01/03/2020", interval="1mo")
        sum_df = sum_df.append(stockData[ticker])
    except:
        continue

sum_df.index=sum_df.index.rename("data")
sum_df.to_excel("output.xlsx") 

In [None]:
output = pd.read_excel("output.xlsx",index_col=None)
output.tail()

In [None]:
output['monthly_return'] = output.groupby(['ticker'])['adjclose'].apply(pd.Series.pct_change)*100
output.tail()

Next, we add the company's sector into the final dataset as a new column by merging the constituents.csv file with our dataset.

In [None]:
spList = spList.rename(columns={'Symbol': 'ticker', 'Sector': 'sector', "Name": 'name'})
df2 = pd.DataFrame({'name': ["S&P 500"],'ticker': ['SPY'], 'sector': ['All']})
spList = spList.append(df2, ignore_index=True, sort=False)

In [None]:
finalOutput = pd.merge(output, spList, on="ticker")


check = (output.shape[0] == finalOutput.shape[0])

print("Was the join performed correctly? {}".format(check))


print(finalOutput.shape[0])
print(output.shape[0])

finalOutput.tail()
finalOutput.to_csv('finalOutput.csv', index=False)

Now, we create two new columns: one to track if an observation is above the average monthly return for a company and another to track if the previous month's return was higher than the average monthly return.

In [None]:
outputNew = pd.read_csv('finalOutput.csv')
outputNew['month']=pd.DatetimeIndex(outputNew['date']).month
outputNew.head()
tickers = outputNew.ticker.unique()
averages = outputNew.groupby(['ticker']).monthly_return.mean()
for i in tickers:
    outputNew['averages'] = averages[i]
    
def isAboveAverage(averages,monthlyret):
    if monthlyret > averages:
        return 1
    return 0


outputNew["above_avg"] = outputNew.apply(lambda x: isAboveAverage(x["averages"],x['monthly_return']), axis=1)

outputNew.tail()

def previous_above_avg(dataframe):
    i=0
    while i < dataframe.shape[0]:
        startingTicker=dataframe.iloc[i]["ticker"]
        i+=1
        #Keep going until change in ticker or year
        while i < dataframe.shape[0] and dataframe.iloc[i]["ticker"] == startingTicker:
            dataframe.at[i,"prev_above_avg"] =  dataframe.at[i-1,"above_avg"]
            i+=1
    return dataframe

outputNew["prev_above_avg"]=None
outputNew=previous_above_avg(outputNew)
outputNew.to_csv('finalOutput.csv', index=False)
outputNew.head()