# Exploratory Data Analysis

This notebook performs exploratory data analysis (EDA) on the dataset for the Multidisciplinary Deepfake Detection product. It includes steps for visualizing data distributions, examining correlations, and identifying potential outliers.

In [None]:
# To import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import logging

# To set up logging
logging.basicConfig(filename='../logs/exploratory_data_analysis.log', level=logging.INFO,
                    format='%(asctime)s %(levelname)s: %(message)s', datefmt='%Y-%m-%d %H:%M:%S')

# To load configuration
from src.config import Config

# To define paths
processed_data_path = os.path.join(Config.PROCESSED_DATA_DIR, 'processed_data.csv')

logging.info("EDA started.")

# To load processed data
logging.info("Loading processed data from {}.".format(processed_data_path))
data = pd.read_csv(processed_data_path)
logging.info("Processed data loaded successfully with shape {}.".format(data.shape))

# To display information about the dataset
data.info()

# Display statistics about the dataset
data.describe()

## Data Distribution

To visualize the distribution of numerical features and the distribution of labels.

In [None]:
# To visualize the distribution of numerical features
logging.info("Visualizing the distribution of numerical features.")
numerical_features = data.select_dtypes(include=[np.number]).columns
data[numerical_features].hist(figsize=(15, 10), bins=30)
plt.suptitle('Distribution of Numerical Features')
plt.savefig('../logs/distribution_numerical_features.png')
plt.show()
logging.info("Distribution of numerical features visualized.")

# To visualize the distribution of the target label
logging.info("Visualizing the distribution of the target label.")
sns.countplot(x='label', data=data)
plt.title('Distribution of Target Label')
plt.savefig('../logs/distribution_target_label.png')
plt.show()
logging.info("Distribution of target label visualized.")

## Correlation Analysis

To examine the correlation between numerical features and the target label.

In [None]:
# Correlation matrix
logging.info("Computing correlation matrix.")
correlation_matrix = data.corr()

# To plot the correlation matrix
logging.info("Plotting correlation matrix.")
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.savefig('../logs/correlation_matrix.png')
plt.show()
logging.info("Correlation matrix plotted.")

## Pairplot Analysis

To generate pair plots for a subset of features to identify relationships and potential outliers.

In [None]:
# Pairplot analysis
logging.info("Generating pairplot for numerical features.")
sns.pairplot(data[numerical_features], diag_kind='kde')
plt.suptitle('Pairplot Analysis')
plt.savefig('../logs/pairplot_analysis.png')
plt.show()
logging.info("Pairplot analysis completed.")

## Summary

The exploratory data analysis provided insights into the data distribution, correlation between features, and identified potential relationships and outliers. This information is vital for guiding further data preprocessing and model training steps.

In [None]:
logging.info("EDA completed.")