# S&P 500 Stocks DataSet 


### https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks?select=sp500_stocks.csv

### About dataset
The Standard and Poor's 500 or S&P 500 is the most famous financial benchmark in the world.

This stock market index tracks the performance of 500 large companies listed on stock exchanges in the United States. As of December 31, 2020, more than $5.4 trillion was invested in assets tied to the performance of this index.

Because the index includes multiple classes of stock of some constituent companies—for example, Alphabet's Class A (GOOGL) and Class C (GOOG) - there are actually 505 stocks in the gauge.

### Dataset description
The dataset unites 3 subsets, each in separate csv files:
### 1) sp500_stocks.csv
The stocks subset contains 1843998 rows and 7 columns:
-  **Date**: the date from 2010-01-04 to 2024-07-29 
- **Symbol**: Company Symbol/Ticker
- **Adj Close**: Similar to the price at market closure, yet also takes into account company actions such as dividends and splits
- **Close**: Price at market closure
- **High**: Maximum value of period
- **Low**: Minimum value of period
- **Open**: Price at market opening
- **Volume**: Volume traded

### 2) sp500_index.csv
The index subset contains 2517 rows and 2 columns:
-  **Date**: the date from 2014-07-28 to 2024-07-26
- **S&P500**: S&P500 index

### 3) sp500_companies.csv
The companies subset contains 503 rows and 16 columns:
- **Exchange**: The stock exchange where the company is listed.
- **Symbol**: The stock ticker symbol.
- **Shortname**: The short name of the company.
- **Longname**: The full name of the company.
- **Sector**: The sector to which the company belongs.
- **Industry**: The industry within the sector.
- **Currentprice**: The current price of the stock.
- **Marketcap**: The market capitalization of the company.
- **Ebitda**: Earnings before interest, taxes, depreciation, and amortization.
- **Revenuegrowth**: The revenue growth rate of the company.
- **City**: The city where the company is headquartered.
- **State**: The state where the company is headquartered.
- **Country**: The country where the company is headquartered.
- **Fulltimeemployees**: The number of full-time employees.
- **Longbusinesssummary**: A detailed summary of the company's business.
- **Weight**: The weight of the company in the S&P 500 index.

In [6]:
# Import important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [7]:
# 1) p500_stocks - Load dataset 
stocks_df = pd.read_csv("sp500_stocks.csv")


In [8]:
# 1) p500_stocks - Set the date as index.
stocks_df['Date'] = pd.to_datetime(stocks_df['Date'])
stocks_df.set_index('Date', inplace=True)
stocks_df.sample(10)

Unnamed: 0_level_0,Symbol,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-01-24,JCI,69.383362,73.790001,74.110001,70.529999,71.18,5566400.0
2014-08-29,WST,41.721828,43.43,43.610001,43.25,43.41,273500.0
2013-11-04,LUV,16.15406,17.870001,18.129999,17.530001,17.530001,9384600.0
2011-10-27,AFL,18.033596,23.385,23.99,22.924999,22.975,17083000.0
2011-12-01,TEL,24.536238,31.280001,31.809999,31.17,31.48,3340500.0
2013-12-27,DE,76.713364,90.699997,91.209999,90.360001,90.889999,2014000.0
2018-08-08,KVUE,,,,,,
2019-12-16,HLT,107.355698,108.480003,109.300003,107.699997,108.040001,2516700.0
2010-09-08,SYK,37.948124,45.48,46.049999,45.290001,46.049999,2911800.0
2021-12-31,CPB,39.813961,43.459999,43.549999,43.119999,43.259998,1083900.0


In [9]:
# 1) p500_stocks - General info
stocks_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1843998 entries, 2010-01-04 to 2024-07-29
Data columns (total 7 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Symbol     object 
 1   Adj Close  float64
 2   Close      float64
 3   High       float64
 4   Low        float64
 5   Open       float64
 6   Volume     float64
dtypes: float64(6), object(1)
memory usage: 112.5+ MB


In [17]:
# 2) sp500_index - Load dataset 
index_df = pd.read_csv("sp500_index.csv")


In [18]:
# 2) sp500_index - Set the date as index.
index_df['Date'] = pd.to_datetime(index_df['Date'])
index_df.set_index('Date', inplace=True)
index_df.sample(10)

Unnamed: 0_level_0,S&P500
Date,Unnamed: 1_level_1
2021-01-21,3853.07
2021-04-28,4183.18
2017-05-26,2415.82
2015-03-11,2040.24
2023-05-11,4130.62
2023-08-04,4478.03
2022-11-04,3770.55
2016-12-09,2259.53
2024-05-08,5187.67
2024-02-26,5069.53


In [19]:
# 2) sp500_index - General info
index_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2517 entries, 2014-07-28 to 2024-07-26
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   S&P500  2517 non-null   float64
dtypes: float64(1)
memory usage: 39.3 KB


In [23]:
# 3) sp500_companies - Load dataset 
companies_df = pd.read_csv("sp500_companies.csv")

In [24]:
# 3) sp500_companies - Display the 10 sample rows
companies_df.sample(10)

Unnamed: 0,Exchange,Symbol,Shortname,Longname,Sector,Industry,Currentprice,Marketcap,Ebitda,Revenuegrowth,City,State,Country,Fulltimeemployees,Longbusinesssummary,Weight
104,NYQ,FI,"Fiserv, Inc.","Fiserv, Inc.",Technology,Information Technology Services,162.71,93676216320,8482000000.0,0.074,Milwaukee,WI,United States,42000.0,"Fiserv, Inc., together with its subsidiaries, ...",0.00184
102,NMS,GILD,"Gilead Sciences, Inc.","Gilead Sciences, Inc.",Healthcare,Drug Manufacturers - General,77.73,96839925760,12665000000.0,0.053,Foster City,CA,United States,18000.0,"Gilead Sciences, Inc., a biopharmaceutical com...",0.001903
95,NMS,PANW,"Palo Alto Networks, Inc.","Palo Alto Networks, Inc.",Technology,Software - Infrastructure,322.06,104283029504,1076500000.0,0.153,Santa Clara,CA,United States,15166.0,"Palo Alto Networks, Inc. provides cybersecurit...",0.002049
291,NMS,WTW,Willis Towers Watson Public Lim,Willis Towers Watson Public Limited Company,Financial Services,Insurance Brokers,278.93,28327014400,2527000000.0,0.049,London,,United Kingdom,48000.0,Willis Towers Watson Public Limited Company op...,0.000557
157,NYQ,TFC,Truist Financial Corporation,Truist Financial Corporation,Financial Services,Banks - Regional,44.46,59497263104,,,Charlotte,NC,United States,41368.0,"Truist Financial Corporation, a financial serv...",0.001169
7,NYQ,BRK-B,Berkshire Hathaway Inc. New,Berkshire Hathaway Inc.,Financial Services,Insurance - Diversified,438.31,945031479296,107046000000.0,0.052,Omaha,NE,United States,396500.0,"Berkshire Hathaway Inc., through its subsidiar...",0.018567
211,NMS,ODFL,"Old Dominion Freight Line, Inc.","Old Dominion Freight Line, Inc.",Industrials,Trucking,201.5,43782926336,2011926000.0,0.061,Thomasville,NC,United States,22796.0,"Old Dominion Freight Line, Inc. operates as a ...",0.00086
325,NMS,FSLR,"First Solar, Inc.","First Solar, Inc.",Technology,Solar,220.4,23591835648,1439703000.0,0.448,Tempe,AZ,United States,6700.0,"First Solar, Inc., a solar technology company,...",0.000464
138,NYQ,TDG,Transdigm Group Incorporated,TransDigm Group Incorporated,Industrials,Aerospace & Defense,1236.12,69171421184,3604000000.0,0.205,Cleveland,OH,United States,15500.0,"TransDigm Group Incorporated designs, produces...",0.001359
93,NMS,KLAC,KLA Corporation,KLA Corporation,Technology,Semiconductor Equipment & Materials,778.54,104822620160,4036656000.0,0.091,Milpitas,CA,United States,,"KLA Corporation designs, manufactures, and mar...",0.002059


In [25]:
# 3) sp500_companies - General info
companies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Exchange             503 non-null    object 
 1   Symbol               503 non-null    object 
 2   Shortname            503 non-null    object 
 3   Longname             503 non-null    object 
 4   Sector               503 non-null    object 
 5   Industry             503 non-null    object 
 6   Currentprice         503 non-null    float64
 7   Marketcap            503 non-null    int64  
 8   Ebitda               474 non-null    float64
 9   Revenuegrowth        501 non-null    float64
 10  City                 503 non-null    object 
 11  State                483 non-null    object 
 12  Country              503 non-null    object 
 13  Fulltimeemployees    499 non-null    float64
 14  Longbusinesssummary  503 non-null    object 
 15  Weight               503 non-null    flo

### Initial Research Questions
1. What are the trends in stock prices and trading volumes for S&P 500 companies over the past decades?
2. What are the historical performance and growth trends of the S&P 500 index as a whole?
3. How do specific industries within the S&P 500 perform compared to the overall index?
4. Which sectors have shown the most consistent growth over the past decades?
5. How does the market capitalization and revenue growth of S&P 500 companies correlate with long-term stock performance?
6. What are the top-performing companies in terms of revenue growth and market capitalization?
7. What are the historical volatility and risk profiles of top-performing companies?
8. Which companies have the strongest balance sheets and financial health indicators?


**Objective**: Predict stock prices using historical data.

**Techniques**: Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Time Series Analysis

**Skills**: Deep learning, time series forecasting, model evaluation (MSE, RMSE)