# Exploratory Data Analysis

In this notebook, we will perform exploratory data analysis (EDA) on the dataset. EDA is crucial for understanding the data, identifying patterns, and uncovering insights that can inform model development.

## Steps to Follow:
1. Load the dataset
2. Inspect the data
3. Visualize distributions of features
4. Analyze correlations between features
5. Identify missing values
6. Generate summary statistics

## Load Libraries
Let's start by loading the necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualisation style
sns.set(style='whitegrid')

## Load the Dataset
Now, we will load the dataset from the processed data directory.

In [None]:
data_path = '../data/processed/your_processed_data.csv'
df = pd.read_csv(data_path)
df.head()

## Inspect the Data
Let's check the shape and data types of the dataset.

In [None]:
print(f'Dataset Shape: {df.shape}')
print(df.info())

## Visualize Distributions of Features
We will visualize the distributions of the numerical features in the dataset.

In [None]:
numerical_features = df.select_dtypes(include=['float64', 'int64']).columns
for feature in numerical_features:
    plt.figure(figsize=(10, 5))
    sns.histplot(df[feature], bins=30, kde=True)
    plt.title(f'Distribution of {feature}')
    plt.show()

## Analyze Correlations
Next, we will analyze the correlations between numerical features.

In [None]:
plt.figure(figsize=(12, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## Identify Missing Values
Let's check for any missing values in the dataset.

In [None]:
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

## Summary Statistics
Finally, we will generate summary statistics for the dataset.

In [None]:
summary_statistics = df.describe()
summary_statistics