# Comprehensive Exploratory Data Analysis (EDA) Report

## Introduction

This report provides a detailed exploratory data analysis (EDA) of the given dataset. The primary objectives of this EDA are to:

- Understand the underlying structure of the data.
- Identify patterns, anomalies, and relationships within the dataset.
- Prepare the data for further modeling by cleaning and transforming it.

We will cover various aspects of the EDA process, including data loading, data cleaning, univariate, bivariate, and multivariate analysis, and Principal Component 
Analysis (PCA).

## Core Feature
In the duration  of our project, our primary goal has been making the processes 
MODULAR, and FUNCTIONALISED, focusing on catering to the entire dataset at any given step. We aim to standardise the process through this technique

## Data Loading

In this section, we will load the dataset and perform an initial examination to understand its structure and basic characteristics.

import pandas as pd
df1 = pd.read_csv('File Path')

In [None]:
import pandas as pd
df1 = pd.read_csv('File Path')

## Data Cleaning

Data cleaning is a crucial step to ensure the quality of the dataset. In this section, we will handle missing values and detect and treat outliers.

### Null Handling

We will address any missing values in the dataset using appropriate strategies such as imputation , using various techniques, focusing on maximizing the quality of the data

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

def smart_impute(file_path):
    # Read the CSV file
    df = pd.read_csv(file_path)
    
    # Function to determine if a column is numeric
    def is_numeric(col):
        return pd.api.types.is_numeric_dtype(df[col])
    
    # Function to determine if a column is datetime
    def is_datetime(col):
        return pd.api.types.is_datetime64_any_dtype(df[col])
    
    for column in df.columns:
        if df[column].isnull().sum() > 0:  # Check if column has missing values
            if is_numeric(column):
                # For numeric columns, use KNN imputation
                imputer = KNNImputer(n_neighbors=5)
                df[column] = imputer.fit_transform(df[[column]])
            elif is_datetime(column):
                # For datetime columns, forward fill and then backward fill
                df[column] = df[column].fillna(method='ffill').fillna(method='bfill')
            else:
                # For categorical/text columns, use mode
                df[column] = df[column].fillna(df[column].mode()[0])
    
    # Save the processed DataFrame back to the original file
    df.to_csv(file_path, index=False)
    print(f"Updated original file: {file_path}")
    
    return df
    


### Outlier Handling

We will identify outliers using statistical methods such as the Interquartile Range (IQR) and Z-score and decide on appropriate treatment methods to handle them.

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

def handle_outliers(file_path):
    df = pd.read_csv(file_path)
    
    for column in df.select_dtypes(include=[np.number]).columns:
        # Check for normality
        _, p_value = stats.normaltest(df[column].dropna())
        
        if p_value < 0.05:  # Not normally distributed
            # Use IQR method
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            df[column] = df[column].clip(lower_bound, upper_bound)
            print(f"Used IQR method for {column}")
        else:  # Normally distributed
            # Use Z-score method
            z_scores = np.abs(stats.zscore(df[column]))
            df[column] = df[column].mask(z_scores > 3, df[column].median())
            print(f"Used Z-score method for {column}")
        
    
    df.to_csv(file_path, index=False)
    print(f"Handled outliers in {file_path}")
    return df

# Data Analysis

In this section, we'll perform an exploratory data analysis (EDA) on our cleaned dataset. EDA is a critical step in understanding our data's characteristics, patterns, and relationships. We'll divide our analysis into two phases:

1. Analysis by columns (pre-PCA)
2. Analysis post-PCA (Principal Component Analysis)

First, let's import the necessary libraries:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import os

# Use Agg backend for Matplotlib
import matplotlib
matplotlib.use('Agg')

These libraries will help us load, manipulate, and visualize our data effectively.

## Analysis by columns (pre-PCA)

In this phase, we'll analyze our data based on its original columns. This helps us understand the basic structure and characteristics of our dataset.

### Univariate Analysis
We'll perform univariate analysis, which examines each variable in the dataset individually:

In [None]:

def plot_univariate(df, column, ax):
    sns.histplot(df[column], kde=True, ax=ax)
    ax.set_title(column)
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')

This creates histograms for each numerical column. Histograms show the distribution of values in each column, helping us identify patterns like normal distributions, skewness, or outliers.

### Bivariate Analysis
Bivariate analysis examines the relationship between pairs of variables:

In [None]:

def plot_bivariate(df, col1, col2, ax):
    sns.scatterplot(data=df, x=col1, y=col2, ax=ax)
    ax.set_title(f'{col1} vs {col2}')

This creates scatter plots for pairs of numerical variables. Scatter plots help us visualize relationships between variables, such as positive or negative correlations, or more complex patterns.

### Multivariate Analysis
A correlation heatmap provides an overview of relationships between all pairs of variables

In [None]:

def plot_correlation_heatmap(df, prefix="correlation_heatmap"):
    os.makedirs("plots", exist_ok=True)
    plt.figure(figsize=(12, 10))
    corr_matrix = df.corr()
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title(f'{prefix.replace("_", " ").title()}')
    plt.savefig(f'plots/{prefix}.png')
    plt.close()

The heatmap displays correlation coefficients between -1 and 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation. This helps us quickly identify which variables are most strongly related to each other.

## Phase 2: Analysis Post-PCA
In this phase, we'll perform Principal Component Analysis (PCA) and analyze the results. PCA is a technique used to reduce the dimensionality of a dataset while retaining as much of the original variability as possible.

### PCA
**PCA** works by finding new variables (principal components) that are linear combinations of the original variables. These new variables are uncorrelated and ordered so that the first few retain most of the variation present in all of the original variables.

In [None]:
def reduce_dimensions_with_pca(df, n_components=2):
    scaler = StandardScaler()
    numeric_df = df.select_dtypes(include=[np.number])
    scaled_data = scaler.fit_transform(numeric_df)
    pca = PCA(n_components=n_components)
    principal_components = pca.fit_transform(scaled_data)
    pca_df = pd.DataFrame(data=principal_components, columns=[f'PC{i+1}' for i in range(n_components)])
    explained_variance = pca.explained_variance_ratio_
    
    loadings = pd.DataFrame(pca.components_.T, columns=[f'PC{i+1}' for i in range(n_components)], index=numeric_df.columns)
    
    top_features_dict = {f'PC{i+1}': ', '.join(loadings[f'PC{i+1}'].abs().sort_values(ascending=False).head(3).index.tolist()) for i in range(n_components)}
    
    print(f"Explained variance ratio by PCA: {explained_variance}")
    return pd.concat([df.reset_index(drop=True), pca_df], axis=1), loadings, top_features_dict, explained_variance

This function standardizes our data, performs PCA, and returns the transformed data along with information about the principal components.

### Analyzing PCA Results
**Univariate and Bivariate Analysis of PCA Components**
We'll perform the same univariate and bivariate analyses on our PCA components as we did on the original variables:

In [None]:
# Perform univariate and bivariate analysis on PCA components
pca_columns = [col for col in pca_df.columns if col.startswith('PC')]
combined_analysis(pca_df, pca_columns, plot_univariate, prefix="univariate_pca")
combined_analysis(pca_df, pca_columns, plot_bivariate, prefix="bivariate_pca")

This helps us understand the distribution and relationships of our new principal components.

**Multivariate Analysis of PCA Components** 

In [None]:
# Create correlation heatmap for PCA components
plot_correlation_heatmap(pca_df[pca_columns], prefix="pca_correlation_heatmap")

This heatmap should show little to no correlation between components, as PCA creates uncorrelated variables.

### Explained Variance Plot
The explained variance plot shows how much of the total variance in the dataset is explained by each principal component:

In [None]:
def plot_explained_variance(explained_variance):
    plt.figure(figsize=(10, 6))
    plt.bar(range(1, len(explained_variance) + 1), explained_variance, align='center', alpha=0.8)
    plt.title('Explained Variance Ratio by Principal Components')
    plt.xlabel('Principal Component')
    plt.ylabel('Explained Variance Ratio')
    plt.xticks(range(1, len(explained_variance) + 1))
    plt.grid(True)
    plt.savefig('explained_variance.png')  # Save the figure to a file
    plt.close()  # Close the figure to release resources

This plot helps us determine how many principal components are needed to explain a significant portion of the variance in our data.

### Biplot of First Two Principal Components
A biplot allows us to visualize both the principal components and the original features in the same plot:

In [None]:
def biplot(pca_df, loadings):
    plt.figure(figsize=(10, 8))
    sns.scatterplot(data=pca_df, x='PC1', y='PC2')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.title('Biplot of PC1 and PC2')
    
    for i in loadings.index:
        plt.arrow(0, 0, loadings.loc[i, 'PC1'], loadings.loc[i, 'PC2'], color='r', alpha=0.5)
        plt.text(loadings.loc[i, 'PC1']*1.1, loadings.loc[i, 'PC2']*1.1, i, color='g', ha='center', va='center')

    plt.grid(True)
    plt.savefig('plots/biplot.png')
    plt.close()

This plot shows how the original features contribute to the first two principal components and how the data points are distributed in this new space.


### Interpreting PCA Results
Finally, we'll look at the PCA loadings and top contributing features:

In [None]:
print("\nPCA Loadings (Feature Contributions):")
print(loadings)

print("\nTop Contributing Features per Principal Component:")
for pc, features in top_features_dict.items():
    print(f"{pc}: {features}")

The loadings show how much each original feature contributes to each principal component. The top contributing features give us a quick summary of which original features are most important for each principal component.

## Conclusion
This two-phase analysis provides a comprehensive view of our data. In Phase 1, we examined the original features, their distributions, and relationships, giving us a ground-level understanding of our dataset. In Phase 2, we used PCA to reduce dimensionality and potentially uncover hidden patterns in the data.

The pre-PCA analysis helps us understand individual features and their direct relationships, while the post-PCA analysis provides insights into the overall structure of the data and the most important combinations of features that explain the variance in our dataset.
By comparing the results from both phases, we can gain a deeper understanding of our data's structure and the underlying patterns that might not be immediately apparent from the original variables alone.

**NOTE**
- All the code for the processes can be found in their respective folders in the repository.
- The detailed analysis reports of the dataset can be found under the Reports folder, where this file lies