# Exploratory Data Analysis
## Data Overview
**Goal**: Understand the structure of the dataset, including the number of rows, columns, and data types.

**Key Steps**: Load the data and print a concise summary of the dataset using .info() and .head().

In [1]:
#Import necessary libraries
import pandas as pd
import logging
import os, sys
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))
# Import load_data module
from load_data import load_data # type: ignore

# Set max rows and columns to display
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

# Configure logging
from custom_logger import setup_logger # type: ignore
logger = setup_logger()

logger.info("Imported necessary libraries.")

2025-01-23 15:57:22,294 - INFO - Imported necessary libraries.


Load the dataset

In [2]:
logger.info("Data loading initiated.")
df = load_data('../data/data.csv')  # Assume load_data() is your function
logger.info("Data loaded successfully.")

2025-01-23 15:57:25,132 - INFO - Data loading initiated.
2025-01-23 15:57:25,414 - INFO - Data loaded successfully.


Data successfully loaded from ../data/data.csv
Dataset contains 95662 rows and 15 columns.



### 1. Data overview:
Provide an overview of the dataset including shape, data types, missing values, and first few rows.

In [9]:
# Import the class CreditRiskEDA
from credit_eda_analysis import CreditRiskEDA
# Initialize the class
cr_eda = CreditRiskEDA(df)
# Logging activity
logger.info("Data overview initiated.")

# Overview of the dataset
if not df.empty:
    cr_eda.data_overview()

logger.info("Data overview successfully completed.")

2025-01-23 16:01:53,052 - INFO - Data overview initiated.


Data Overview:
Number of rows: 95662
Number of columns: 15

Column Data Types:
BatchId                              object
AccountId                            object
SubscriptionId                       object
CustomerId                           object
CurrencyCode                         object
CountryCode                           int64
ProviderId                           object
ProductId                            object
ProductCategory                      object
ChannelId                            object
Amount                              float64
Value                                 int64
TransactionStartTime    datetime64[ns, UTC]
PricingStrategy                       int64
FraudResult                           int64
dtype: object

First Five Rows:


Unnamed: 0_level_0,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
TransactionId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15 02:18:49+00:00,2,0
TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15 02:19:08+00:00,2,0
TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15 02:44:21+00:00,2,0
TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15 03:32:55+00:00,2,0
TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15 03:34:21+00:00,2,0


2025-01-23 16:01:53,096 - INFO - Data overview successfully completed.



Missing Values Overview:
BatchId                 0
AccountId               0
SubscriptionId          0
CustomerId              0
CurrencyCode            0
CountryCode             0
ProviderId              0
ProductId               0
ProductCategory         0
ChannelId               0
Amount                  0
Value                   0
TransactionStartTime    0
PricingStrategy         0
FraudResult             0
dtype: int64


### Data Overview Summary

+ The dataset contains 95,662 rows and 15 columns, providing a substantial amount of data for analysis.

+ The data types are generally correct, except for the TransactionStartTime column, which is currently stored as an object. This column should be converted to datetime format for accurate time-based analysis.

+ No missing values were detected in any of the columns, ensuring the dataset is complete and ready for further analysis without the need for imputation.

In [10]:
# Convert the TransactionStartTime to appropriate datetime format
df['TransactionStartTime'] = pd.to_datetime(df['TransactionStartTime'])

### 2. Summary Statistics

Understand the central tendency, dispersion, and shape of the dataset’s distribution.

In [11]:
# Log the start of the summary statistics process
logger.info("Generating summary statistics for numeric columns...")   

# Statistical summary
summary_stats = cr_eda.summary_statistics()
display(summary_stats.T)
# Log completion
logger.info("Summary statistics generation completed.")

2025-01-23 16:02:01,302 - INFO - Generating summary statistics for numeric columns...


Unnamed: 0,CountryCode,Amount,Value,PricingStrategy,FraudResult
count,95662.0,95662.0,95662.0,95662.0,95662.0
mean,256.0,6717.846,9900.584,2.255974,0.002018
std,0.0,123306.8,123122.1,0.732924,0.044872
min,256.0,-1000000.0,2.0,0.0,0.0
25%,256.0,-50.0,275.0,2.0,0.0
50%,256.0,1000.0,1000.0,2.0,0.0
75%,256.0,2800.0,5000.0,2.0,0.0
max,256.0,9880000.0,9880000.0,4.0,1.0
median,256.0,1000.0,1000.0,2.0,0.0
mode,256.0,1000.0,1000.0,2.0,0.0


2025-01-23 16:02:01,360 - INFO - Summary statistics generation completed.


### Observations:

+ The CountryCode is constant at 256, indicating no variability in this column.

+ The Amount and Value features show a high degree of skewness (both over 51), indicating potential outliers or heavy-tailed distributions.

+ PricingStrategy is fairly stable with most values centered around 2.

+ FraudResult has a very low mean, indicating a low occurrence of fraudulent transactions, with a highly positive skewness suggesting a majority of zeros.

### Distribution of Numerical Features

+ Visualize the distribution of numerical features to identify patterns, skewness, and potential outliers.

In [6]:
# Import the class CreditRiskEDA
from credit_eda_visualize import CreditRiskEDAVisualize
# Initialize the class
cr_eda_visual = CreditRiskEDAVisualize(df)
# Logger activity
logger.info("Plot Numberical Distribution...")
# List of numeric columns
numeric_cols = df.select_dtypes(include='number').columns

# Plot distibution
cr_eda_visual.plot_numerical_distribution(numeric_cols)
logger.info("The distribution plot successfully completed.")

ModuleNotFoundError: No module named 'matplotlib'