# Exploratory Data Analysis

In [1]:
import logging
import pandas as pd
import os
import sys

# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))

# Import the load_data module
try:
    from data_loader import load_data
    logger_initialized = True
except ImportError as e:
    logger_initialized = False
    print(f"Error importing 'load_data': {e}")

# Set pandas display options for better visibility
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

In [2]:
# Configure logging
def setup_logger(name: str = 'my_logger') -> logging.Logger:
    """
    Set up a logger with INFO level and StreamHandler.
    
    Parameters:
    -----------
    name : str
        The name of the logger.
    
    Returns:
    --------
    logging.Logger
        Configured logger instance.
    """
    logger = logging.getLogger(name)
    logger.setLevel(logging.INFO)
    
    # Prevent duplicate handlers
    if not logger.hasHandlers():
        handler = logging.StreamHandler()
        handler.setLevel(logging.INFO)
        formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
        handler.setFormatter(formatter)
        logger.addHandler(handler)
    
    return logger

# Initialize logger
logger = setup_logger()
logger.info("Imported necessary libraries.")

# Check and log if 'load_data' was successfully imported
if logger_initialized:
    logger.info("'load_data' module imported successfully.")
else:
    logger.warning("'load_data' module could not be imported. Check the 'scripts' directory and file availability.")

2025-01-23 15:14:20,449 - INFO - Imported necessary libraries.
2025-01-23 15:14:20,450 - INFO - 'load_data' module imported successfully.


#### Data Loading

In [3]:
logger.info("🟢 Starting the data loading process...")
df = load_data('../data/data.csv')
if not df.empty:
    logger.info(f"✅ Data loaded successfully! The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")
else:
    logger.warning("⚠️ Data loading completed, but the dataset is empty.")

2025-01-23 15:14:20,470 - INFO - 🟢 Starting the data loading process...
2025-01-23 15:14:20,727 - INFO - ✅ Data loaded successfully! The dataset contains 95662 rows and 15 columns.


Data successfully loaded from '../data/data.csv' with 95662 rows and 15 columns.


In [4]:
# Import the class CreditRiskAnalysisEDA
from eda_analysis import CreditRiskAnalysis

# Initialize the class
cr_eda = CreditRiskAnalysis(df)

# Logging activity
logger.info("🟢 Data overview initiated.")

2025-01-23 15:14:21,056 - INFO - 🟢 Data overview initiated.


### **Overview of the dataset**

In [5]:
# Overview of the dataset
if not df.empty:
    cr_eda.data_overview()

logger.info("✅ Data overview successfully completed.")

            Data Overview           
Number of Rows: 95662
Number of Columns: 15

Column Data Types:
BatchId                  object
AccountId                object
SubscriptionId           object
CustomerId               object
CurrencyCode             object
CountryCode               int64
ProviderId               object
ProductId                object
ProductCategory          object
ChannelId                object
Amount                  float64
Value                     int64
TransactionStartTime     object
PricingStrategy           int64
FraudResult               int64
dtype: object

First Five Rows:


Unnamed: 0_level_0,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
TransactionId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0
TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0
TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0
TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2,0
TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2,0


2025-01-23 15:14:21,099 - INFO - ✅ Data overview successfully completed.



Missing Values Overview:
Series([], dtype: int64)


#### Summary

The dataset contains a total of **95,662 transactions** with **15 attributes**, including:

- **Categorical Identifiers**: 
  - BatchId
  - AccountId
  - CustomerId

- **Financial Metrics**: 
  - Amount
  - Value

- **Timestamps**: 
  - TransactionStartTime

There are **no missing values** in the dataset, indicating completeness in the collected information.


In [6]:
# Convert the 'TransactionStartTime' column to a datetime format for better handling of date and time data
df['TransactionStartTime'] = pd.to_datetime(df['TransactionStartTime'])

# Print the first five rows to confirm the conversion and check the updated DataFrame
print("Updated 'TransactionStartTime' column:")
print(df[['TransactionStartTime']].head())

Updated 'TransactionStartTime' column:
                         TransactionStartTime
TransactionId                                
TransactionId_76871 2018-11-15 02:18:49+00:00
TransactionId_73770 2018-11-15 02:19:08+00:00
TransactionId_26203 2018-11-15 02:44:21+00:00
TransactionId_380   2018-11-15 03:32:55+00:00
TransactionId_28195 2018-11-15 03:34:21+00:00


### **Statistics Summary**

In [13]:
# Log the start of the summary statistics process
logger.info("🟢 Generating summary statistics for numeric columns...")   
print("================================================================================")

# Generate statistical summary for numeric columns
summary_stats = cr_eda.summary_statistics()

# Display the transposed summary statistics for better readability
print("================================================================================")
print("Summary statistics generated:")
display(summary_stats.T)


# Log the completion of the summary statistics generation
print("================================================================================")
logger.info("✅ Summary statistics generation completed.")

2025-01-23 15:17:22,816 - INFO - 🟢 Generating summary statistics for numeric columns...


Summary Statistics:
                    count         mean            std        min    25%  \
Statistic                                                                
CountryCode      95662.0   256.000000       0.000000      256.0  256.0   
Amount           95662.0  6717.846433  123306.797164 -1000000.0  -50.0   
Value            95662.0  9900.583941  123122.087776        2.0  275.0   
PricingStrategy  95662.0     2.255974       0.732924        0.0    2.0   
FraudResult      95662.0     0.002018       0.044872        0.0    0.0   

                    50%     75%        max  median    mode   skewness  \
Statistic                                                               
CountryCode       256.0   256.0      256.0   256.0   256.0   0.000000   
Amount           1000.0  2800.0  9880000.0  1000.0  1000.0  51.098490   
Value            1000.0  5000.0  9880000.0  1000.0  1000.0  51.291086   
PricingStrategy     2.0     2.0        4.0     2.0     2.0   1.659057   
FraudResult         0.

Statistic,CountryCode,Amount,Value,PricingStrategy,FraudResult
count,95662.0,95662.0,95662.0,95662.0,95662.0
mean,256.0,6717.846,9900.584,2.255974,0.002018
std,0.0,123306.8,123122.1,0.732924,0.044872
min,256.0,-1000000.0,2.0,0.0,0.0
25%,256.0,-50.0,275.0,2.0,0.0
50%,256.0,1000.0,1000.0,2.0,0.0
75%,256.0,2800.0,5000.0,2.0,0.0
max,256.0,9880000.0,9880000.0,4.0,1.0
median,256.0,1000.0,1000.0,2.0,0.0
mode,256.0,1000.0,1000.0,2.0,0.0


2025-01-23 15:17:22,919 - INFO - ✅ Summary statistics generation completed.




#### **Summary Statistics**

**General Information**
- **Total Entries**: 95,662

**Attributes**

**Country Code**
- **Mean**: 256.0
- **Standard Deviation**: 0.0
- **Min/Max**: 256.0 / 256.0
- **Skewness**: 0.0
- **Kurtosis**: 0.0
- **IQR**: 0.0

**Amount**
- **Mean**: 6,717.85
- **Standard Deviation**: 123,306.80
- **Min/Max**: -1,000,000.0 / 9,880,000.0
- **Skewness**: 51.10 (highly right-skewed)
- **Kurtosis**: 3,363.13 (heavy-tailed)
- **IQR**: 2,850.0

**Value**
- **Mean**: 9,900.58
- **Standard Deviation**: 123,122.09
- **Min/Max**: 2.0 / 9,880,000.0
- **Skewness**: 51.29 (highly right-skewed)
- **Kurtosis**: 3,378.07 (heavy-tailed)
- **IQR**: 4,725.0

**Pricing Strategy**
- **Mean**: 2.26
- **Standard Deviation**: 0.73
- **Min/Max**: 0.0 / 4.0
- **Skewness**: 1.66 (right-skewed)
- **Kurtosis**: 2.09 (relatively normal)
- **IQR**: 0.0

**Fraud Result**
- **Mean**: 0.002
- **Standard Deviation**: 0.045
- **Min/Max**: 0.0 / 1.0
- **Skewness**: 22.20 (very right-skewed)
- **Kurtosis**: 490.69 (highly peaked)
- **IQR**: 0.0

#### **Key Observations**
- The **Country Code** shows no variability across entries.
- The **Amount** and **Value** fields exhibit significant right skewness and presence of outliers.
- **Pricing Strategy** is relatively normally distributed.
- The **Fraud Result** indicates a majority of transactions are non-fraudulent.

#### **Conclusion**
The dataset presents a diverse range of financial transactions, with some attributes requiring further investigation into their distribution characteristics.