# Correlation Analysis for Top 20 Stocks in the S&P 500

## Acquiring Ticker Symbols for Analysis

### To conduct a thorough analysis of the largest stocks by market capitalization in the S&P 500, we have two primary methods for obtaining the necessary ticker symbols:

1. ***Web Scraping***: This approach involves programmatically navigating to a financial website and extracting the list of ticker symbols. This method ensures that the data is up-to-date and reflects the latest market changes.

2. ***CSV Utilization***: Alternatively, we can use a precompiled CSV file containing the list of ticker symbols. While this is simpler and faster, it's important to ensure that the CSV data is current and accurate to reflect the top stocks at the time of analysis.
Choose the method that best suits the requirements for currency and convenience for the correlation study.

In [1]:
import pandas as pd
import datetime
import requests
from bs4 import BeautifulSoup

# URL of the page
url = 'WEBSITE_URL' # Paste URL

# Make a request to the website
response = requests.get(url)
webpage = response.text

# Parse the webpage with BeautifulSoup
soup = BeautifulSoup(webpage, 'html.parser')

# Assume the data is stored in a table with an identifiable class or id
# This is a generic example; you'll need to inspect the webpage to find the correct tags/classes
table = soup.find('table', class_='FILL_IN')
  # Replace 'the_table_id' with the actual ID or class

ticker_symbols = []



if table:
    # Extract and print the ticker symbols from the second column of each row
    for row in table.find_all('tr'):
        cells = row.find_all('td')
        if len(cells) > 1:  # Ensure there are at least two columns
            ticker_symbol = cells[1].text.strip()  # Second column for the symbol
            ticker_symbols.append(ticker_symbol)
else:
    print("Table not found. Check the class name or table structure.")

df_tickers = pd.DataFrame(ticker_symbols, columns=['Ticker'])

today = datetime.datetime.now()

print(df_tickers)
print(today)


csv_file_path = '/Users/LOGIN/Desktop/top_50_spy_tickers.csv'  # Define the path and filename for your CSV file
df_tickers.to_csv(csv_file_path, index=False)  # Set index=False to avoid writing row indices to the CSV file

print(f"Data exported successfully to {csv_file_path}")

MissingSchema: Invalid URL 'WEBSITE_URL': No scheme supplied. Perhaps you meant https://WEBSITE_URL?

# Correlation Analysis for Select S&P 500 Stocks

This Jupyter Notebook contains a Python script designed to perform a correlation analysis on the top 20 stocks from the S&P 500. The analysis follows these steps:

1. **Data Acquisition**: The script starts by loading a list of ticker symbols from a CSV file located on the user's desktop. This file contains the tickers for the top 50 S&P 500 stocks by market capitalization.

2. **Ticker Selection**: From the loaded list, only the first 20 tickers are selected to ensure a manageable dataset size for this demonstration.

3. **Date Range Input**: The user is prompted to input the start and end dates for the data retrieval, allowing for a dynamic range of historical data to be analyzed.

4. **Data Fetching**: Utilizing the `yfinance` library, historical closing prices for the selected tickers are downloaded within the user-defined date range. Any tickers without data are skipped.

5. **Percentage Change Computation**: The script computes the daily percentage changes in closing prices, which is essential for understanding the volatility and the price movement over time.

6. **Correlation Matrix Calculation**: A correlation matrix is calculated from these daily percentage changes, providing insight into the relationships between the stock price movements.

7. **Heatmap Visualization**: A heatmap is plotted using `seaborn` to visually represent the correlation matrix, employing a green color gradient for aesthetic consistency.

8. **Distribution Analysis**: To analyze the distribution of correlation coefficients, the script generates a histogram plot, offering a statistical view of how often each correlation coefficient occurs.

By the end of this notebook, we'll have a clear visual and statistical understanding of how these top S&P 500 stocks correlate with one another within the specified date range.


In [None]:
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns

# Load tickers from CSV file
df_tickers = pd.read_csv("PATH/Desktop/top_50_spy_tickers.csv")# Enter path

# Assuming the ticker symbols are in a column named 'Ticker'
tickers = df_tickers['Ticker'].tolist()

# Only take the first 20 tickers for analysis
tickers = tickers[:20]

# Define the period for which you want the data
start_date = input("Please enter start(yyyy-mm-dd): ")
end_date = input("Please enter end(yyyy-mm-dd): ")

# Create an empty DataFrame to store closing prices
closing_prices = pd.DataFrame()

# Loop through each ticker and fetch the data
for ticker in tickers:
    data = yf.download(ticker, start=start_date, end=end_date)
    # Ensure data contains 'Close' column and has valid data (not empty)
    if 'Close' in data and not data['Close'].empty:
        closing_prices[ticker] = data['Close']  # Extracting the Close prices
    else:
        print(f"No data for {ticker}, skipping.")

# Calculate daily percentage change
daily_change = closing_prices.pct_change()

# Calculate the correlation matrix from daily percentage changes
correlation_matrix = daily_change.corr()

# Plotting the correlation matrix using seaborn
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='BuGn', cbar=True, linewidths=0.5, ax=ax)
plt.title('Correlation Matrix for SPY ETF Tickers')
fig.savefig('/Users/LOGIN/Desktop/correlation_matrix.pdf')  # Save as PDF
plt.show()

# Flattening the correlation matrix and filtering out self-correlations (diagonal elements)
flat_corr = correlation_matrix.unstack()
flat_corr = flat_corr[flat_corr.index.get_level_values(0) != flat_corr.index.get_level_values(1)]

# Plotting the distribution of correlation coefficients
fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(flat_corr, bins=30, kde=True, color='green', ax=ax)
plt.title('Distribution of Correlation Coefficients')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Frequency')
fig.savefig('/Users/LOGIN/Desktop/distribution_of_correlation_coefficients.pdf')  # Save as PDF
plt.show()
