# S&P Constituents

## Introduction

- insert motivation(s) - the what?
- insert methodology - the why?

Import packages

In [1]:
import numpy as np

In [2]:
import pandas as pd

In [3]:
import requests as re

In [4]:
from bs4 import BeautifulSoup

In [5]:
import pickle

## Wikipedia

Set the pages we want to scrape S&P indices data from

In [6]:
page = ['https://en.wikipedia.org/wiki/List_of_S%26P_500_companies',
        'https://en.wikipedia.org/wiki/List_of_S%26P_400_companies',
        'https://en.wikipedia.org/wiki/List_of_S%26P_1000_companies']

Define the files we want to store extracted data to

In [7]:
file = ['sp500_wikipedia.pickle',
        'sp400_wikipedia.pickle',
        'sp1000_wikipedia.pickle']

The code below scrapes data using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), and saves the extracted data using [pickle](https://docs.python.org/3/library/pickle.html)

In [8]:
for i in range(len(page)):
    
    # Get URL    
    r = re.get(page[i])
    
    # Create a soup object 
    soup = BeautifulSoup(r.text)
    
    # Find S&P constituents table
    table = soup.find('table', attrs={'class', 'wikitable sortable'})
    
    # Get the rows containing the tickers
    tickers = table.find_all('a', attrs={'class', 'external text'})
    # find_all returns tickers and SEC fillings, get tickers only
    tickers = tickers[::2]
    
    # Create a list containing the tickers
    sp = []
    for j in range(len(tickers)):
        sp.append(tickers[j].text)
        
    # Save the list to a file
    output = open(file[i], 'wb')
    pickle.dump(sp, output)
    output.close()
    

Define the S&P constituents lists from Wikipedia

In [9]:
sp500_wikipedia = []
sp400_wikipedia = []
sp1000_wikipedia = []
sp_wikipedia = [sp500_wikipedia,
                sp400_wikipedia,
                sp1000_wikipedia]

Load the pickled data to the `sp_wikipedia`

In [10]:
for i in range(len(sp_wikipedia)):
    with open(file[i], 'rb') as f:
        sp_wikipedia[i]= pickle.load(f)
    f.close()

Check the number of constituents, it should be equal to 505 according to [S&P Dow Jones Indices](https://us.spindices.com/indices/equity/sp-500)

In [11]:
sp500_wikipedia = sp_wikipedia[0]

In [12]:
len(sp500_wikipedia)

505

Check the number of constituents, it should be equal to 400 according to [S&P Dow Jones Indices](https://us.spindices.com/indices/equity/sp-400)

In [13]:
sp400_wikipedia = sp_wikipedia[1]

In [14]:
len(sp400_wikipedia)

400

Check the number of constituents, it should be equal to 1001 according to [S&P Dow Jones Indices](https://us.spindices.com/indices/equity/sp-1000)

In [15]:
sp1000_wikipedia = sp_wikipedia[2]

In [16]:
len(sp1000_wikipedia)

1001

Create a list of S&P 600 constituents given that the S&P 1000 index is the sum of S&P 400 and S&P 600 indices

In [17]:
sp600_wikipedia = list(set(sp1000_wikipedia) - set(sp400_wikipedia))

In [18]:
len(sp600_wikipedia)

598

In total, Wikipedia tickers in total are only 598, while [S&P Dow Jones Indices](https://www.spindices.com/indices/equity/sp-600) indicate that there should be 601:
- the 3 tickers missing should be due to the fact that S&P 600 list is deduced from S&P 400 and S&P 100 indices, and 
- there a is possible timing difference when the consituents of SP 400 and SP 1000 indices where updated.

## Barchart

We download the the below files in csv format from https://www.barchart.com. Note that you need to sign up (free) first before getting access.

In [19]:
path = ['s&p-500-index-05-04-2019.csv',
        'sp-400-index-05-04-2019.csv',
        'sp-600-index-05-04-2019.csv']

Define the files we want to store extracted data to

In [20]:
file = ['sp500_barchart.pickle',
        'sp400_barchart.pickle',
        'sp600_barchart.pickle']

The code below reads the data from the csv file, stores it to a DataFrame object, and saves the extracted information using [pickle](https://docs.python.org/3/library/pickle.html)

In [21]:
for i in range(len(path)):
    
    # Read data to a DataFrame
    data = pd.read_csv(path[i])
    # Exclude the last line since it does not contain a ticker
    data = data[:-1]
    
    # Create a list containing the tickers
    sp = []
    for j in range(len(data['Symbol'])):
        sp.append(data['Symbol'].iloc[j])
        
    # Save the list to a file
    output = open(file[i], 'wb')
    pickle.dump(sp, output)
    output.close()

Define the S&P constituents lists from Barchart

In [22]:
sp500_barchart = []
sp400_barchart = []
sp1000_barchart = []
sp_barchart = [sp500_barchart,
                sp400_barchart,
                sp1000_barchart]

Load the pickled data to the `sp_barchart`

In [23]:
for i in range(len(sp_barchart)):
    with open(file[i], 'rb') as f:
        sp_barchart[i]= pickle.load(f)
    f.close()

Check the number of constituents, it should be equal to 505 according to [S&P Dow Jones Indices](https://us.spindices.com/indices/equity/sp-500)

In [24]:
sp500_barchart = sp_barchart[0]

In [25]:
len(sp500_barchart)

505

Check the number of constituents, it should be equal to 400 according to [S&P Dow Jones Indices](https://us.spindices.com/indices/equity/sp-400)

In [26]:
sp400_barchart = sp_barchart[1]

In [27]:
len(sp400_barchart)

400

Check the number of constituents, it should be equal to 601 according to [S&P Dow Jones Indices](https://us.spindices.com/indices/equity/sp-600)

In [28]:
sp600_barchart = sp_barchart[2]

In [29]:
len(sp600_barchart)

601

# Comparison between Wikipedia and Barchart

### S&P 500

Sort the lists

In [37]:
sp500_wikipedia.sort()

In [38]:
sp500_barchart.sort()

Eyeball

In [39]:
sp500_wikipedia[:10]

['A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABMD', 'ABT', 'ACN', 'ADBE']

In [40]:
sp500_barchart[:10]

['A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABMD', 'ABT', 'ACN', 'ADBE']

Difference between Wikipedia and Barchart

In [43]:
diff_wikipedia_barchart = list(set(sp500_wikipedia) - set(sp500_barchart))

In [44]:
diff_wikipedia_barchart

[]

There is no difference between the Wikipedia and Barchart lists.

### S&P 400

Sort the lists

In [45]:
sp400_wikipedia.sort()

In [46]:
sp400_barchart.sort()

Eyeball

In [47]:
sp400_wikipedia[:10]

['AAN', 'ACC', 'ACHC', 'ACIW', 'ACM', 'ADNT', 'AEO', 'AFG', 'AGCO', 'AHL']

In [48]:
sp400_barchart[:10]

['AAN', 'ACC', 'ACHC', 'ACIW', 'ACM', 'ADNT', 'AEO', 'AFG', 'AGCO', 'ALE']

Difference between Wikipedia and Barchart

In [49]:
diff_wikipedia_barchart = list(set(sp400_wikipedia) - set(sp400_barchart))

In [65]:
diff_wikipedia_barchart[:10]

['QHC', 'NSP', 'MHLD', 'INGN', 'PRSP', 'PSB', 'PAY', 'FRAN', 'FTNT', 'CLD']

In [51]:
len(diff_wikipedia_barchart)

28

Difference between Barchart and Wikipedia

In [52]:
diff_barchart_wikipedia = list(set(sp400_barchart) - set(sp400_wikipedia))

In [63]:
diff_barchart_wikipedia[:10]

['XPO', 'NSP', 'YELP', 'OLED', 'INGN', 'PRSP', 'PSB', 'CFX', 'REZI', 'AMED']

In [54]:
len(diff_barchart_wikipedia)

28

The difference between the two sources Wikipedia and Barchart lists is 28 tickers, which suggests that either Wikipedia list is outdated (Barchart contains updated tickers), or the inverse (Barchart is outdate).

### S&P 600

Sort the lists

In [55]:
sp600_wikipedia.sort()

In [56]:
sp600_barchart.sort()

Eyeball

In [57]:
sp600_wikipedia[:10]

['AAOI', 'AAON', 'AAT', 'AAWW', 'AAXN', 'ABCB', 'ABG', 'ABM', 'ACET', 'ACLS']

In [58]:
sp600_barchart[:10]

['AAOI', 'AAON', 'AAT', 'AAWW', 'AAXN', 'ABCB', 'ABG', 'ABM', 'ACA', 'ACLS']

Difference between Wikipedia and Barchart

In [59]:
diff_wikipedia_barchart = list(set(sp600_wikipedia) - set(sp600_barchart))

In [62]:
diff_wikipedia_barchart[:10]

['QHC', 'NSP', 'MHLD', 'INGN', 'PRSP', 'PSB', 'PAY', 'FRAN', 'FTNT', 'CLD']

In [61]:
len(diff_wikipedia_barchart)

51

Difference between Barchart and Wikipedia

In [66]:
diff_barchart_wikipedia = list(set(sp600_barchart) - set(sp600_wikipedia))

In [67]:
diff_barchart_wikipedia[:10]

['KLXE', 'ACA', 'DO', 'CONN', 'TNC', 'BOOM', 'GTX', 'OPI', 'NBR', 'TRHC']

In [68]:
len(diff_barchart_wikipedia)

54

In total, Wikipedia tickers in total are only 598, while Barchart tickers are a 601 (complete):
- there 3 tickers missing in Wikipedia list due to the fact that SP 600 list is deduced from SP 400 and SP 100 indices
- that is possible timing difference of when the consituents when SP 400 and SP 1000 indices where updated
- the difference between the two sources Wikipedia and Barchart lists is 51 tickers (excluding the 3 tickers due to timing difference), which suggests that either  Wikipedia list is outdated (Barchart contains updated tickers), or the inverse (Barchart is outdate). 

## Conclusion

insert conclusion