# Data Exploration Notebook

## Introduction
In this notebook, we will explore the historical market data to understand its structure, visualize distributions, and identify any trends or patterns.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style="whitegrid")

## Load Data

In [2]:
# Load the historical data
data = pd.read_csv('../data/historical_data.csv')

# Display the first few rows of the dataset
data.head()

   date  feature1  feature2  target
0  2021-01-01      1.0      2.0     10
1  2021-01-02      1.5      2.5     15
2  2021-01-03      2.0      3.0     20
3  2021-01-04      2.5      3.5     25
4  2021-01-05      3.0      4.0     30

## Data Overview

In [3]:
# Get basic information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      5 non-null      object 
 1   feature1  5 non-null      float64
 2   feature2  5 non-null      float64
 3   target    5 non-null      int64 
dtypes: float64(2), int64(1), object(1)
memory usage: 168.0+ bytes

In [4]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)

Missing values in each column:
date        0
feature1    0
feature2    0
target      0
dtype: int64

## Data Visualization
### Distribution of Target Variable

In [5]:
plt.figure(figsize=(10, 6))
sns.histplot(data['target'], bins=30, kde=True)
plt.title('Distribution of Target Variable')
plt.xlabel('Target Variable')
plt.ylabel('Frequency')
plt.show()

Text(0.5, 1.0, 'Distribution of Target Variable')

### Correlation Heatmap

In [6]:
plt.figure(figsize=(12, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Correlation Heatmap')
plt.show()

Text(0.5, 1.0, 'Correlation Heatmap')

### Time Series Plot

In [7]:
plt.figure(figsize=(14, 7))
data['date'] = pd.to_datetime(data['date'])  # Ensure 'date' is in datetime format
plt.plot(data['date'], data['target'], label='Target Variable', color='blue')
plt.title('Target Variable Over Time')
plt.xlabel('Date')
plt.ylabel('Target Variable')
plt.legend()
plt.show()

Text(0.5, 1.0, 'Target Variable Over Time')

## Conclusion
In this notebook, we explored the historical market data, visualized the distribution of the target variable, and examined correlations between features. This analysis will guide us in feature selection and model training in the next notebook.