# S&P 500 Stocks Data Analysis

This notebook analyzes the S&P 500 stocks dataset to visualize its characteristics, identify class imbalances, and detect any anomalies.

In [1]:
# !pip install pandas matplotlib seaborn

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline
plt.style.use('ggplot')
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load and Inspect the Data

In [3]:
# Load the dataset
df = pd.read_csv('../data/all_stocks_5yr.csv')

# Display basic information about the dataset
print(df.info())
print("\nFirst few rows:")
print(df.head())
print("\nBasic statistics:")
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 619040 entries, 0 to 619039
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   date    619040 non-null  object 
 1   open    619029 non-null  float64
 2   high    619032 non-null  float64
 3   low     619032 non-null  float64
 4   close   619040 non-null  float64
 5   volume  619040 non-null  int64  
 6   Name    619040 non-null  object 
dtypes: float64(4), int64(1), object(2)
memory usage: 33.1+ MB
None

First few rows:
         date   open   high    low  close    volume Name
0  2013-02-08  15.07  15.12  14.63  14.75   8407500  AAL
1  2013-02-11  14.89  15.01  14.26  14.46   8882000  AAL
2  2013-02-12  14.45  14.51  14.10  14.27   8126000  AAL
3  2013-02-13  14.30  14.94  14.25  14.66  10259500  AAL
4  2013-02-14  14.94  14.96  13.16  13.99  31879900  AAL

Basic statistics:
                open           high            low          close  \
count  619029.000000  619032.0000

## 2. Check for Missing Values

In [4]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values:")
print(missing_values)
print("\nCleaned dataset shape:", df.shape)

Missing values:
date       0
open      11
high       8
low        8
close      0
volume     0
Name       0
dtype: int64

Cleaned dataset shape: (619040, 7)


In [5]:
# Remove rows with missing values
df_cleaned = df.dropna()
print("\nCleaned dataset shape:", df_cleaned.shape)


Cleaned dataset shape: (619029, 7)


In [6]:
# Converts the 'date' column to datetime objects.
# Uses errors='coerce' to handle invalid parsing; invalid dates become NaT.
df['date'] = pd.to_datetime(df['date'], errors='coerce')

In [7]:
# Drops rows where 'date' conversion failed.
# Ensures that all entries have valid dates.
df = df.dropna(subset=['date'])

In [8]:
# Eliminates any duplicate rows to prevent skewed calculations.
df = df.drop_duplicates()

In [9]:
# Save the cleaned dataset
output_file = "../data/clean_stocks_data.csv"
df_cleaned.to_csv(output_file, index=False)
print(f"\nCleaned data saved to {output_file}")


Cleaned data saved to ../data/clean_stocks_data.csv


## 3. Analyze Class Distribution (Stock Names)

In [10]:
# Count the number of data points for each stock
stock_counts = df['Name'].value_counts()

# Print some statistics about the distribution
print(f"Number of unique stocks: {len(stock_counts)}")
print(f"Average number of data points per stock: {stock_counts.mean():.2f}")
print(f"Minimum number of data points: {stock_counts.min()}")
print(f"Maximum number of data points: {stock_counts.max()}")

Number of unique stocks: 505
Average number of data points per stock: 1225.82
Minimum number of data points: 44
Maximum number of data points: 1259
