# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [93]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
import seaborn as sns
from datetime import datetime

# Autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# Install yfinance
#pip install yfinance
import yfinance as yf

# user written modules
import dataproject as dp

plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"--"})
plt.rcParams.update({'font.size': 14})


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Reading and cleaning data

I import a CSV file that contains S&P 500 companies and their industries.

In [94]:
# Read file, sort values in alphabetical order and reset index
SP500 = (pd.read_csv('sp500-companies 2.csv', encoding='ISO-8859-1')
         .sort_values(by=['Ticker'],ascending=True)
         .reset_index(drop=True))

# Drop columns we don't need
drop_columns = ['Sub-Industry', 'Headquarters Location', 'Date added', 'Founded']
SP500.drop(drop_columns, axis=1, inplace=True)

# Remove duplicates
SP500.index.duplicated(keep='first')

# Display dataframe
SP500.head()

Unnamed: 0,Ticker,Name,Industry
0,A,Agilent Technologies,Health Care
1,AAL,American Airlines Group,Industrials
2,AAP,Advance Auto Parts,Consumer Discretionary
3,AAPL,Apple Inc.,Information Technology
4,ABBV,AbbVie,Health Care


I create a list of yfinance tickers to pass as input to yfinance.

In [95]:
# Create a list of yfinance tickers 
SP500_tickers = list(SP500['Ticker'])

In [96]:
# Download historical market data
hist_prices = yf.download(tickers = SP500_tickers, start = '2023-01-01',
                        end = '2023-12-31',
                        interval = '1mo')

# Get adjusted close for each stock and change dates
hist_prices = hist_prices['Adj Close']

# Change dateformat
hist_prices.index = pd.to_datetime(hist_prices.index, format='%m-%y')

# Display DataFrame
hist_prices.head()

[*********************100%%**********************]  503 of 503 completed

9 Failed downloads:
['ATVI', 'BRK.B', 'DISH', 'ABC', 'FRC', 'RE', 'PKI', 'CDAY']: Exception('%ticker%: No timezone found, symbol may be delisted')
['BF.B']: Exception('%ticker%: No price data found, symbol may be delisted (1mo 2023-01-01 -> 2023-12-31)')


Ticker,A,AAL,AAP,AAPL,ABBV,ABC,ABT,ACGL,ACN,ADBE,...,WYNN,XEL,XOM,XRAY,XYL,YUM,ZBH,ZBRA,ZION,ZTS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-01,150.738098,16.139999,148.047424,143.305115,139.367401,,107.328468,64.349998,272.630829,370.339996,...,102.589592,65.787468,111.14698,36.091808,102.376045,127.461823,126.122299,316.179993,50.097919,163.622955
2023-02-01,140.7173,15.98,140.930878,146.403824,146.548462,,99.202927,70.0,260.502533,323.950012,...,107.271652,61.769623,105.302689,37.30695,101.037407,124.190079,122.685493,300.25,47.704227,165.503891
2023-03-01,137.119324,14.75,118.229889,164.024475,151.757172,,98.754303,67.870003,280.377533,385.369995,...,110.775772,64.515152,105.871613,38.492699,103.373154,129.588943,127.964516,318.0,28.427031,164.948914
2023-04-01,134.235001,13.64,122.040924,168.779099,143.90126,,107.736404,75.07,274.962463,377.559998,...,113.121758,67.414871,114.251747,41.242657,102.524048,137.928619,137.377823,288.029999,26.460979,174.205231
2023-05-01,114.836136,14.78,71.751022,176.308914,132.57843,,99.973137,69.699997,301.283752,417.790009,...,97.699654,62.959755,98.650024,35.527893,98.930176,126.262878,126.362991,262.570007,25.919601,161.896591


In [97]:
# Remove columns with NaN values
hist_prices_clean = hist_prices.dropna(axis=1)


In [98]:
# Calculate monthly and cumulative returns 
monthly_returns, cumulative_returns = dp.calculate_returns(hist_prices_clean)

# Set the first row of the cumulative returns to 1
cumulative_returns.iloc[0] = 1

# Display DataFrame
cumulative_returns.head()


Ticker,A,AAL,AAP,AAPL,ABBV,ABT,ACGL,ACN,ADBE,ADI,...,WYNN,XEL,XOM,XRAY,XYL,YUM,ZBH,ZBRA,ZION,ZTS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-01-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2023-02-01,0.933522,0.990087,0.951931,1.021623,1.051526,0.924293,1.087801,0.955514,0.874737,1.069983,...,1.045639,0.938927,0.947418,1.033668,0.986924,0.974332,0.97275,0.949617,0.95222,1.011496
2023-03-01,0.909653,0.913879,0.798595,1.144582,1.0889,0.920113,1.054701,1.028415,1.040584,1.155525,...,1.079795,0.98066,0.952537,1.066522,1.00974,1.016688,1.014607,1.005756,0.567429,1.008104
2023-04-01,0.890518,0.845105,0.824337,1.17776,1.032532,1.003801,1.166589,1.008552,1.019496,1.053929,...,1.102663,1.024737,1.027934,1.142715,1.001446,1.082117,1.089243,0.910968,0.528185,1.064675
2023-05-01,0.761826,0.915737,0.484649,1.230304,0.951287,0.931469,1.083139,1.105098,1.128126,1.041097,...,0.952335,0.957017,0.887564,0.984376,0.966341,0.990594,1.001908,0.830445,0.517379,0.989449


In [99]:
# Group companies by sector
grouped_companies = {}
for index, row in SP500.iterrows():
    if row['Industry'] in grouped_companies:
        grouped_companies[row['Industry']].append(row['Ticker'])
        #print(grouped_companies)
    else:
        grouped_companies[row['Industry']] = [row['Ticker']]
        #print(grouped_companies)

print(grouped_companies)
    

{'Health Care': ['A', 'ABBV', 'ABC', 'ABT', 'ALGN', 'AMGN', 'BAX', 'BDX', 'BIIB', 'BIO', 'BMY', 'BSX', 'CAH', 'CI', 'CNC', 'COO', 'CRL', 'CTLT', 'CVS', 'DGX', 'DHR', 'DVA', 'DXCM', 'ELV', 'EW', 'GEHC', 'GILD', 'HCA', 'HOLX', 'HSIC', 'HUM', 'IDXX', 'ILMN', 'INCY', 'IQV', 'ISRG', 'JNJ', 'LH', 'LLY', 'MCK', 'MDT', 'MOH', 'MRK', 'MRNA', 'MTD', 'OGN', 'PFE', 'PKI', 'PODD', 'REGN', 'RMD', 'STE', 'SYK', 'TECH', 'TFX', 'TMO', 'UHS', 'UNH', 'VRTX', 'VTRS', 'WAT', 'WST', 'XRAY', 'ZBH', 'ZTS'], 'Industrials': ['AAL', 'ALK', 'ALLE', 'AME', 'AOS', 'BA', 'CARR', 'CAT', 'CHRW', 'CMI', 'CPRT', 'CSGP', 'CSX', 'CTAS', 'DAL', 'DE', 'DOV', 'EFX', 'EMR', 'ETN', 'EXPD', 'FAST', 'FDX', 'FTV', 'GD', 'GE', 'GNRC', 'GWW', 'HII', 'HON', 'HWM', 'IEX', 'IR', 'ITW', 'J', 'JBHT', 'JCI', 'LDOS', 'LHX', 'LMT', 'LUV', 'MAS', 'MMM', 'NDSN', 'NOC', 'NSC', 'ODFL', 'OTIS', 'PCAR', 'PH', 'PNR', 'PWR', 'RHI', 'ROK', 'ROL', 'RSG', 'RTX', 'SNA', 'SWK', 'TDG', 'TT', 'TXT', 'UAL', 'UNP', 'UPS', 'URI', 'VRSK', 'WAB', 'WM', 'XYL']

## Exploring each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

In [100]:
industries = grouped_companies.keys()
dropdown = widgets.Dropdown(options = industries, description='Stock:')

widgets.interact()

<ipywidgets.widgets.interaction._InteractFactory at 0x10421a710>

Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.