# Data Processing and Exploration #

This notebook is dedicated to the preprocessing, cleaning, and exploration of data that will be shared across different bots and models within the ensemble AI system. The primary objective is to ensure that the data is transformed and prepared in a consistent format suitable for multiple models to analyze. The processed data will serve as the foundational dataset for the machine learning algorithms and strategies developed in the project.

## Objectives:

- **Data Preprocessing**: Handle missing values, normalize data, and perform necessary transformations to optimize model performance.
- **Data Exploration**: Perform exploratory data analysis (EDA) to understand key trends, distributions, correlations, and patterns in the data.
- **Shared Data Pipeline**: This notebook centralizes the data preparation process to ensure that all models and bots within the system receive uniform and well-structured input data.

This notebook will focus on repeatable and scalable data handling techniques that can easily integrate with the AI pipeline for further analysis, training, and evaluation.


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import random
import seaborn as sb

In [3]:
sp500 = pd.read_csv('combined_sp500_data.csv')

In [7]:
sp500.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Ticker
0,2023-01-03,101.605354,102.541809,100.643814,102.399666,93.82341,3124909,MMM
1,2023-01-04,103.135452,104.757523,102.600334,104.640465,95.876534,3312561,MMM
2,2023-01-05,103.854515,104.155518,102.391304,102.809364,94.198799,3117494,MMM
3,2023-01-06,104.230766,106.29599,103.469902,105.953178,97.079315,2890732,MMM
4,2023-01-09,106.187294,108.244148,105.443146,106.011703,97.132935,3434075,MMM


In [8]:
appl_data = pd.read_csv('sp500_stocks_indv/AAPL_data.csv')

In [10]:
appl_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2023-01-03,130.279999,130.899994,124.169998,125.07,123.904633,112117500
1,2023-01-04,126.889999,128.660004,125.080002,126.360001,125.182617,89113600
2,2023-01-05,127.129997,127.769997,124.760002,125.019997,123.855095,80962700
3,2023-01-06,126.010002,130.289993,124.889999,129.619995,128.412216,87754700
4,2023-01-09,130.470001,133.410004,129.889999,130.149994,128.937302,70790800


In [11]:
sp500_extras = pd.read_csv('csp500_data_with_financials.csv')

In [12]:
sp500_extras.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Ticker,Price_to_Book,Beta,Market_Cap,Forward_PE,Dividend_Yield
0,2023-01-03,101.605354,102.541809,100.643814,102.399666,93.82341,3124909,MMM,19.250843,0.995,75382358016,17.391636,0.0206
1,2023-01-04,103.135452,104.757523,102.600334,104.640465,95.876534,3312561,MMM,19.250843,0.995,75382358016,17.391636,0.0206
2,2023-01-05,103.854515,104.155518,102.391304,102.809364,94.198799,3117494,MMM,19.250843,0.995,75382358016,17.391636,0.0206
3,2023-01-06,104.230766,106.29599,103.469902,105.953178,97.079315,2890732,MMM,19.250843,0.995,75382358016,17.391636,0.0206
4,2023-01-09,106.187294,108.244148,105.443146,106.011703,97.132927,3434075,MMM,19.250843,0.995,75382358016,17.391636,0.0206
