<a href="https://colab.research.google.com/github/Gee7pips/eskom-unit-oracle/blob/main/Untitled14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Create an advanced data cleaning and preprocessing system for South African Credit Risk Rating Models.

## Data loading and initial exploration

### Subtask:
Load the credit risk data and perform initial exploration to understand its structure, identify data types, and get a sense of data quality. This includes checking for missing values, inconsistencies, and potential outliers.


**Reasoning**:
Load the dataset and perform initial exploration steps as requested in the instructions.



In [1]:
import pandas as pd

# 1. Load the credit risk dataset
try:
    df = pd.read_csv('south_african_credit_risk.csv')
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # Assuming the task cannot proceed without the data file
    raise

# 2. Display the first 5 rows
print("First 5 rows of the DataFrame:")
display(df.head())

# 3. Print the data types of each column
print("\nData types of each column:")
print(df.dtypes)

# 4. Display the dimensions of the DataFrame
print("\nDimensions of the DataFrame (rows, columns):")
print(df.shape)

# 5. Use the .info() method
print("\nDataFrame Info:")
df.info()

# 6. Use the .describe() method
print("\nDescriptive statistics for numerical columns:")
display(df.describe())

# 7. Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# 8. Identify and display unique values for a few key categorical columns
# Assuming 'Gender', 'Marital Status', and 'Education' are key categorical columns.
# Replace with actual column names if different.
categorical_cols = ['Gender', 'Marital Status', 'Education']
print("\nUnique values and counts for key categorical columns:")
for col in categorical_cols:
    if col in df.columns:
        print(f"\nColumn: {col}")
        print(df[col].value_counts())
    else:
        print(f"\nColumn '{col}' not found in the DataFrame.")

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Data loading and initial exploration retry

### Subtask:
Retry loading the credit risk data and performing initial exploration, ensuring the dataset file is accessible. This includes checking for missing values, inconsistencies, and potential outliers.


**Reasoning**:
Retry loading the data and performing initial exploration as instructed, assuming the file is now accessible.



In [2]:
# 1. Load the credit risk dataset, assuming the file is now accessible
try:
    df = pd.read_csv('south_african_credit_risk.csv')
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # If file is still not found, raise the error to signal failure
    raise

# 2. Display the first few rows
print("First 5 rows of the DataFrame:")
display(df.head())

# 3. Print the data types of each column
print("\nData types of each column:")
print(df.dtypes)

# 4. Display the dimensions of the DataFrame
print("\nDimensions of the DataFrame (rows, columns):")
print(df.shape)

# 5. Use the .info() method
print("\nDataFrame Info:")
df.info()

# 6. Use the .describe() method
print("\nDescriptive statistics for numerical columns:")
display(df.describe())

# 7. Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# 8. Identify and display unique values for a few key categorical columns
# Assuming 'Gender', 'Marital Status', and 'Education' are key categorical columns based on the previous attempt.
# Replace with actual column names if different in the loaded data.
categorical_cols = ['Gender', 'Marital Status', 'Education']
print("\nUnique values and counts for key categorical columns:")
for col in categorical_cols:
    if col in df.columns:
        print(f"\nColumn: {col}")
        print(df[col].value_counts())
    else:
        print(f"\nColumn '{col}' not found in the DataFrame.")

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Handling missing values

### Subtask:
Develop strategies to handle missing data, considering the impact on credit risk modeling. This might involve imputation techniques appropriate for financial data or domain-specific rules for handling missing credit-related information.


**Reasoning**:
Analyze the missing value counts and implement strategies to handle them based on the data type and domain knowledge relevant to credit risk.



In [3]:
# Analyze missing values
print("Missing values before handling:")
print(df.isnull().sum())

# Identify columns with missing values
missing_cols = df.columns[df.isnull().any()].tolist()
print(f"\nColumns with missing values: {missing_cols}")

# Define strategies for handling missing values based on data type and domain knowledge
# This is a placeholder and should be adapted based on actual data analysis and domain expertise.
# For example, 'Loan Amount' might be imputed with median, 'Credit Score' with a specific value like 0 or median,
# and categorical columns with mode or 'Unknown'.

# Example strategies:
# Impute numerical columns with median (robust to outliers)
numerical_cols_to_impute_median = ['Loan Amount', 'Income'] # Replace with actual numerical columns needing imputation
for col in numerical_cols_to_impute_median:
    if col in df.columns and df[col].isnull().any():
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"Imputed missing values in '{col}' with median ({median_val}).")

# Impute numerical columns with a specific value (e.g., 0 for financial metrics if missing indicates absence)
numerical_cols_to_impute_zero = ['Number of Defaults'] # Replace with actual numerical columns needing imputation with 0
for col in numerical_cols_to_impute_zero:
    if col in df.columns and df[col].isnull().any():
        df[col].fillna(0, inplace=True)
        print(f"Imputed missing values in '{col}' with 0.")

# Impute categorical columns with mode
categorical_cols_to_impute_mode = ['Education', 'Marital Status'] # Replace with actual categorical columns needing imputation
for col in categorical_cols_to_impute_mode:
    if col in df.columns and df[col].isnull().any():
        mode_val = df[col].mode()[0] # mode() can return multiple values, take the first
        df[col].fillna(mode_val, inplace=True)
        print(f"Imputed missing values in '{col}' with mode ({mode_val}).")

# Alternatively, impute categorical columns with 'Unknown'
categorical_cols_to_impute_unknown = ['Employment Type'] # Replace with actual categorical columns needing imputation with 'Unknown'
for col in categorical_cols_to_impute_unknown:
     if col in df.columns and df[col].isnull().any():
        df[col].fillna('Unknown', inplace=True)
        print(f"Imputed missing values in '{col}' with 'Unknown'.")


# Decide whether to drop columns with a high percentage of missing values
# Example: drop columns with more than 50% missing values
threshold = 0.5
cols_to_drop = [col for col in df.columns if df[col].isnull().sum() / len(df) > threshold]
if cols_to_drop:
    print(f"\nDropping columns with more than {threshold*100}% missing values: {cols_to_drop}")
    df.drop(columns=cols_to_drop, inplace=True)

# Verify that missing values have been handled
print("\nMissing values after handling:")
print(df.isnull().sum())

Missing values before handling:


NameError: name 'df' is not defined

**Reasoning**:
The previous command failed because the dataframe `df` was not defined. This is likely due to the failure in the previous subtask to load the data. I need to load the data again before proceeding with missing value handling.



In [4]:
# Load the credit risk dataset, assuming the file is now accessible
try:
    df = pd.read_csv('south_african_credit_risk.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # If file is still not found, raise the error to signal failure
    raise

# Analyze missing values
print("\nMissing values before handling:")
print(df.isnull().sum())

# Identify columns with missing values
missing_cols = df.columns[df.isnull().any()].tolist()
print(f"\nColumns with missing values: {missing_cols}")

# Define strategies for handling missing values based on data type and domain knowledge
# This is a placeholder and should be adapted based on actual data analysis and domain expertise.
# For example, 'Loan Amount' might be imputed with median, 'Credit Score' with a specific value like 0 or median,
# and categorical columns with mode or 'Unknown'.

# Example strategies:
# Impute numerical columns with median (robust to outliers)
numerical_cols_to_impute_median = ['Loan Amount', 'Income'] # Replace with actual numerical columns needing imputation
for col in numerical_cols_to_impute_median:
    if col in df.columns and df[col].isnull().any():
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"Imputed missing values in '{col}' with median ({median_val}).")

# Impute numerical columns with a specific value (e.g., 0 for financial metrics if missing indicates absence)
numerical_cols_to_impute_zero = ['Number of Defaults'] # Replace with actual numerical columns needing imputation with 0
for col in numerical_cols_to_impute_zero:
    if col in df.columns and df[col].isnull().any():
        df[col].fillna(0, inplace=True)
        print(f"Imputed missing values in '{col}' with 0.")

# Impute categorical columns with mode
categorical_cols_to_impute_mode = ['Education', 'Marital Status'] # Replace with actual categorical columns needing imputation
for col in categorical_cols_to_impute_mode:
    if col in df.columns and df[col].isnull().any():
        mode_val = df[col].mode()[0] # mode() can return multiple values, take the first
        df[col].fillna(mode_val, inplace=True)
        print(f"Imputed missing values in '{col}' with mode ({mode_val}).")

# Alternatively, impute categorical columns with 'Unknown'
categorical_cols_to_impute_unknown = ['Employment Type'] # Replace with actual categorical columns needing imputation with 'Unknown'
for col in categorical_cols_to_impute_unknown:
     if col in df.columns and df[col].isnull().any():
        df[col].fillna('Unknown', inplace=True)
        print(f"Imputed missing values in '{col}' with 'Unknown'.")


# Decide whether to drop columns with a high percentage of missing values
# Example: drop columns with more than 50% missing values
threshold = 0.5
cols_to_drop = [col for col in df.columns if df[col].isnull().sum() / len(df) > threshold]
if cols_to_drop:
    print(f"\nDropping columns with more than {threshold*100}% missing values: {cols_to_drop}")
    df.drop(columns=cols_to_drop, inplace=True)

# Verify that missing values have been handled
print("\nMissing values after handling:")
print(df.isnull().sum())

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Addressing outliers

### Subtask:
Identify and address outliers in financial features that could disproportionately influence credit risk models. Techniques might include capping, flooring, or transformations, while considering regulatory requirements and domain expertise.


**Reasoning**:
The dataset needs to be loaded before any outlier detection and handling can be performed. The previous attempts failed because the file was not found, so I will try loading it again.



In [5]:
# Load the credit risk dataset, assuming the file is now accessible
try:
    df = pd.read_csv('south_african_credit_risk.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # If file is still not found, raise the error to signal failure
    raise

# 2. Identify potential financial features with outliers
# Based on common financial datasets and the previous describe() output (if available),
# selecting numerical columns likely to contain outliers.
financial_features = ['Loan Amount', 'Income', 'Credit Score', 'Age'] # Add or remove based on actual data columns and domain knowledge

# 3. Calculate IQR and bounds, and implement outlier handling (capping/flooring)
for feature in financial_features:
    if feature in df.columns:
        print(f"\nProcessing feature: {feature}")
        # Calculate Q1, Q3, and IQR
        Q1 = df[feature].quantile(0.25)
        Q3 = df[feature].quantile(0.75)
        IQR = Q3 - Q1

        # Define outlier bounds (1.5 * IQR rule)
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        print(f"  Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
        print(f"  Lower Bound (1.5*IQR): {lower_bound}")
        print(f"  Upper Bound (1.5*IQR): {upper_bound}")

        # Implement capping and flooring
        # Cap values above the upper bound
        df[feature] = df[feature].apply(lambda x: upper_bound if x > upper_bound else x)
        # Floor values below the lower bound
        df[feature] = df[feature].apply(lambda x: lower_bound if x < lower_bound else x)

        print(f"  Outliers in '{feature}' capped at {upper_bound} and floored at {lower_bound}.")

    else:
        print(f"\nWarning: Feature '{feature}' not found in the DataFrame.")

# 5. Re-examine descriptive statistics for modified columns
print("\nDescriptive statistics for financial features after outlier handling:")
display(df[financial_features].describe())

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Feature engineering

### Subtask:
Create relevant features from existing data that are known to be predictive of credit risk in the South African context. This could involve creating ratios, interaction terms, or time-based features.


**Reasoning**:
The previous attempts failed because the dataset was not found. The first step is to load the dataset, and if successful, proceed with creating the requested features and displaying the results.



In [6]:
import pandas as pd
import numpy as np

# 1. Load the credit risk dataset
try:
    df = pd.read_csv('south_african_credit_risk.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # If file is still not found, raise the error to signal failure
    raise

# Identify relevant numerical columns for feature creation
# Based on common credit risk factors, assuming these columns exist:
# 'Loan Amount', 'Income', 'Credit Score', 'Number of Defaults', 'Credit History Length (Months)'
# Adjust column names based on actual dataset if necessary.
loan_amount_col = 'Loan Amount'
income_col = 'Income'
credit_score_col = 'Credit Score'
defaults_col = 'Number of Defaults' # Assuming a column indicating number of past defaults
credit_history_col = 'Credit History Length (Months)' # Assuming credit history in months
application_date_col = 'Application Date' # Assuming a column with application date

# Ensure required columns exist before proceeding
required_cols = [loan_amount_col, income_col, credit_score_col]
if credit_history_col in df.columns:
    required_cols.append(credit_history_col)
if application_date_col in df.columns:
    required_cols.append(application_date_col)

for col in required_cols:
    if col not in df.columns:
        print(f"Warning: Required column '{col}' not found in the dataset. Skipping related feature engineering.")

# 3. Create debt-to-income ratio
if loan_amount_col in df.columns and income_col in df.columns:
    # Add a small epsilon to the denominator to avoid division by zero
    epsilon = 1e-6
    df['Debt_to_Income_Ratio'] = df[loan_amount_col] / (df[income_col] + epsilon)
    # Handle potential infinite values resulting from division by zero even with epsilon if income is exactly zero
    # Replace infinite values with NaN, then impute NaNs with median or a large number
    df['Debt_to_Income_Ratio'].replace([np.inf, -np.inf], np.nan, inplace=True)
    # Impute resulting NaNs with the median of the calculated ratio
    median_dti = df['Debt_to_Income_Ratio'].median()
    df['Debt_to_Income_Ratio'].fillna(median_dti, inplace=True)
    print("Created 'Debt_to_Income_Ratio' feature.")
else:
    print("Cannot create 'Debt_to_Income_Ratio': Required columns not found.")


# 4. Create interaction terms (e.g., Credit Score * Income)
if credit_score_col in df.columns and income_col in df.columns:
    df['Credit_Score_x_Income'] = df[credit_score_col] * df[income_col]
    print("Created 'Credit_Score_x_Income' interaction term.")
else:
    print("Cannot create 'Credit_Score_x_Income': Required columns not found.")


# 5. Create time-based features (e.g., age of credit history)
if credit_history_col in df.columns:
    # Assuming 'Credit History Length (Months)' is in months
    df['Credit_History_Length_Years'] = df[credit_history_col] / 12
    print("Created 'Credit_History_Length_Years' feature.")
elif application_date_col in df.columns:
     try:
         # Assuming 'Application Date' is in a convertible format
         df[application_date_col] = pd.to_datetime(df[application_date_col])
         # Calculate credit history length based on a reference date (e.g., today or a fixed past date)
         # For demonstration, let's assume a fixed evaluation date after the latest application date in the sample data
         # In a real scenario, this would be the model's evaluation date.
         # Find the latest application date in the dataset to set a plausible reference date
         latest_app_date = df[application_date_col].max()
         evaluation_date = latest_app_date + pd.Timedelta(days=30) # Example: 30 days after the last application
         df['Credit_History_Length_Days'] = (evaluation_date - df[application_date_col]).dt.days
         print(f"Created 'Credit_History_Length_Days' based on '{application_date_col}'.")
     except Exception as e:
         print(f"Could not create time-based features from '{application_date_col}': {e}")
else:
    print("Cannot create time-based features: No relevant date or credit history length column found.")


# 6. The new features are already added to the DataFrame in the steps above.

# 7. Display the first few rows of the DataFrame to show the newly created features
print("\nDataFrame with new features:")
display(df.head())

# 8. Print the .info() and .describe() of the DataFrame to understand the structure and basic statistics of the new features
print("\nDataFrame Info after feature engineering:")
df.info()

print("\nDescriptive statistics of the DataFrame after feature engineering:")
display(df.describe())

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Encoding categorical variables

### Subtask:
Convert categorical features, such as credit history categories or demographic information, into a numerical format suitable for modeling, using appropriate encoding techniques.


**Reasoning**:
The dataset was not found in the previous attempts. I need to load the dataset first before I can identify and encode categorical features.



In [7]:
import pandas as pd

# Load the credit risk dataset, assuming the file is now accessible
try:
    df = pd.read_csv('south_african_credit_risk.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # If file is still not found, raise the error to signal failure
    raise

# 1. Identify categorical columns
categorical_cols = df.select_dtypes(include='object').columns
print(f"\nIdentified categorical columns: {list(categorical_cols)}")

# 2. & 3. Decide and implement encoding techniques
# Assuming 'Gender' and 'Marital Status' are nominal, and 'Education' is ordinal
# For 'Education', we need to define the order. Example order: 'Primary', 'Secondary', 'Tertiary'
# This order is an assumption and should be based on domain knowledge.
nominal_cols = ['Gender', 'Marital Status'] # Add other nominal columns if any
ordinal_cols = ['Education'] # Add other ordinal columns if any

# One-Hot Encoding for nominal columns
if list(nominal_cols):
    print(f"\nApplying One-Hot Encoding to: {list(nominal_cols)}")
    df = pd.get_dummies(df, columns=nominal_cols, drop_first=True) # drop_first=True to avoid multicollinearity
    print("One-Hot Encoding applied.")
else:
    print("\nNo nominal columns to One-Hot Encode.")


# Ordinal Encoding for ordinal columns
if list(ordinal_cols) and 'Education' in ordinal_cols:
    print(f"\nApplying Ordinal Encoding to: {list(ordinal_cols)}")
    # Define the order for 'Education'
    education_order = ['Primary', 'Secondary', 'Tertiary'] # Example order
    # Check if all categories in the column are in the defined order
    if all(item in education_order for item in df['Education'].unique()):
        # Create a mapping dictionary
        education_mapping = {category: index for index, category in enumerate(education_order)}
        # Apply mapping
        df['Education_Encoded'] = df['Education'].map(education_mapping)
        print(f"Ordinal Encoding applied to 'Education' with order: {education_order}")
        # Drop the original column after encoding
        df.drop(columns=['Education'], inplace=True)
    else:
        print(f"Warning: Categories in 'Education' do not match the defined order {education_order}. Skipping Ordinal Encoding for 'Education'.")
else:
     print("\nNo ordinal columns to Ordinal Encode or 'Education' not specified for ordinal encoding.")

# 4. Original columns are dropped in the encoding steps above

# 5. Display the first few rows and check the .info()
print("\nDataFrame after encoding:")
display(df.head())

print("\nDataFrame Info after encoding:")
df.info()

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Data transformation and scaling

### Subtask:
Apply transformations (e.g., logarithmic) to handle skewed distributions and scale features to a similar range, which is important for many credit risk modeling algorithms.


**Reasoning**:
The dataset needs to be loaded first to perform any transformations and scaling. The previous attempts failed because the file was not found. This code block attempts to load the data, identifies numerical columns, applies log transformation after handling zeros or negative values, identifies columns for scaling, and applies standard scaling. Finally, it displays the head, info, and describe of the transformed and scaled dataframe.



In [8]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# 1. Load the credit risk dataset, assuming the file is now accessible.
try:
    df = pd.read_csv('south_african_credit_risk.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # Handle the FileNotFoundError
    raise

# 2. Identify numerical columns in the DataFrame that might benefit from transformation
# These are typically columns with skewed distributions, like income or loan amount.
# We'll select numerical columns and then decide which ones to transform based on potential skewness.
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
print(f"\nIdentified numerical columns: {numerical_cols}")

# Assuming 'Income' and 'Loan Amount' might be skewed and benefit from log transformation.
# Add or remove columns based on actual data analysis and domain knowledge.
cols_to_transform = ['Income', 'Loan Amount'] # Example columns

# 3. Apply a logarithmic transformation to selected numerical columns.
for col in cols_to_transform:
    if col in df.columns:
        # Handle potential zero or negative values before applying log transformation
        # Add a small constant (e.g., 1) to allow log(0) which becomes log(1)=0
        # Or replace values <= 0 with a small positive number or handle as NaN if appropriate
        if (df[col] <= 0).any():
            print(f"Warning: Column '{col}' contains zero or negative values. Applying log1p (log(1+x)) transformation.")
            # Use log1p which is log(1+x) - handles 0 gracefully (log1p(0) = 0)
            df[f'{col}_log'] = np.log1p(df[col])
            # Drop the original column if you prefer to use the transformed one
            # df = df.drop(columns=[col]) # Uncomment if you want to drop original
        else:
            print(f"Applying log transformation to '{col}'.")
            df[f'{col}_log'] = np.log(df[col])
            # Drop the original column if you prefer to use the transformed one
            # df = df.drop(columns=[col]) # Uncomment if you want to drop original
    else:
        print(f"Warning: Column '{col}' not found in the DataFrame. Skipping transformation.")


# 4. Identify numerical columns that need scaling.
# This will include the transformed columns and other numerical columns that were not transformed,
# excluding the target variable if it's numerical.
# Assuming 'Credit Risk' or similar is the target variable and should not be scaled.
# Replace 'Target_Variable_Column_Name' with the actual name of your target variable column.
target_variable = 'Credit Risk' # Example target variable column name

# Exclude the target variable and the original columns that were transformed (if still present)
cols_to_scale = [col for col in numerical_cols if col not in cols_to_transform and col != target_variable]
# Add the newly created log-transformed columns to the list of columns to scale
cols_to_scale.extend([f'{col}_log' for col in cols_to_transform if f'{col}_log' in df.columns])


print(f"\nIdentified numerical columns for scaling: {cols_to_scale}")

# 5. Apply a standard scaler to the identified numerical columns.
if cols_to_scale:
    print(f"\nApplying StandardScaler to: {cols_to_scale}")
    scaler = StandardScaler()
    df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
    print("StandardScaler applied.")
else:
    print("\nNo numerical columns identified for scaling.")


# 6. Display the first few rows of the DataFrame and check the .info() and .describe()
print("\nDataFrame after transformations and scaling:")
display(df.head())

print("\nDataFrame Info after transformations and scaling:")
df.info()

print("\nDescriptive statistics of the DataFrame after transformations and scaling:")
display(df.describe())

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Handling imbalanced data

### Subtask:
Credit risk datasets often have a class imbalance (more good loans than bad loans). Develop strategies to address this, such as oversampling, undersampling, or using techniques like SMOTE.


**Reasoning**:
Load the credit risk dataset. Handle the FileNotFoundError if the file is not found.



In [9]:
import pandas as pd
from collections import Counter
from imblearn.over_sampling import SMOTE
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# 1. Load the credit risk dataset. Handle the FileNotFoundError if the file is not found.
try:
    df = pd.read_csv('south_african_credit_risk.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # If file is still not found, raise the error to signal failure
    raise

# Assuming the target variable column is named 'Credit Risk'. Replace if necessary.
TARGET_VARIABLE = 'Credit Risk'

# 2. Identify the target variable column.
if TARGET_VARIABLE not in df.columns:
    print(f"Error: Target variable column '{TARGET_VARIABLE}' not found in the DataFrame.")
    # Assuming the task cannot proceed without the target variable
    raise ValueError(f"Target variable column '{TARGET_VARIABLE}' not found.")

# 3. Check the distribution of the target variable to confirm class imbalance.
print(f"\nDistribution of the target variable ('{TARGET_VARIABLE}') before handling imbalance:")
display(df[TARGET_VARIABLE].value_counts())
print("\nClass distribution before SMOTE:", Counter(df[TARGET_VARIABLE]))

# 4. Separate the features (X) and the target variable (y).
X = df.drop(columns=[TARGET_VARIABLE])
y = df[TARGET_VARIABLE]
print("\nFeatures (X) and target variable (y) separated.")

# 5. Implement a strategy to handle class imbalance. Use SMOTE for oversampling.
print("\nApplying SMOTE to handle class imbalance...")
smote = SMOTE(random_state=42)

# 6. Apply the chosen technique to the data.
# Note: Applying SMOTE on the entire dataset is for demonstration only as per instructions.
# In a real scenario, SMOTE should be applied only to the training data *after* splitting.
X_resampled, y_resampled = smote.fit_resample(X, y)
print("SMOTE applied.")

# 7. Print the distribution of the target variable after applying the technique.
print(f"\nDistribution of the target variable ('{TARGET_VARIABLE}') after applying SMOTE:")
display(y_resampled.value_counts())
print("\nClass distribution after SMOTE:", Counter(y_resampled))

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Data validation and quality checks

### Subtask:
Implement comprehensive data validation and quality checks throughout the preprocessing pipeline to ensure data integrity and compliance with any relevant regulations.


**Reasoning**:
The primary issue in the previous attempts was the inability to load the dataset. The current subtask requires comprehensive data validation and quality checks, which cannot be done without the data. Therefore, the first step is to attempt to load the dataset again with error handling. If successful, I will proceed with the validation checks as outlined in the instructions.



In [10]:
import pandas as pd
import numpy as np

# 1. Load the credit risk dataset. Include error handling for `FileNotFoundError`.
try:
    df = pd.read_csv('south_african_credit_risk.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.")
    # If file is still not found, raise the error to signal failure
    raise

# 2. Perform validation checks on key columns.
print("\n--- Performing Data Validation Checks ---")

# Identify potential key columns based on typical credit risk datasets
# Replace these with actual column names from your dataset if they differ
loan_amount_col = 'Loan Amount'
income_col = 'Income'
age_col = 'Age'
gender_col = 'Gender'
marital_status_col = 'Marital Status'
education_col = 'Education'
credit_score_col = 'Credit Score'
application_date_col = 'Application Date'
customer_id_col = 'Customer ID' # Assuming a unique identifier column
target_col = 'Credit Risk' # Assuming this is the target variable column

# --- Check for expected data types ---
print("\nChecking data types...")
expected_types = {
    loan_amount_col: np.number,
    income_col: np.number,
    age_col: np.number,
    application_date_col: 'object', # Check as object initially, will convert later if exists
    customer_id_col: np.number, # Assuming customer ID is numerical
    target_col: np.number # Assuming target is numerical (e.g., 0 or 1)
}

for col, expected_type in expected_types.items():
    if col in df.columns:
        # Check if the actual dtype is a subtype of the expected type (especially for numbers)
        if not np.issubdtype(df[col].dtype, expected_type):
             print(f"  Warning: Column '{col}' has unexpected data type '{df[col].dtype}'. Expected a type compatible with '{expected_type}'.")
        else:
            print(f"  Column '{col}' has expected data type '{df[col].dtype}'.")
    else:
        print(f"  Column '{col}' not found.")

# Attempt to convert date column if it exists
if application_date_col in df.columns:
    try:
        df[application_date_col] = pd.to_datetime(df[application_date_col], errors='coerce')
        print(f"  Successfully converted '{application_date_col}' to datetime.")
        if df[application_date_col].isnull().any():
            print(f"  Warning: '{application_date_col}' contains invalid date values after conversion (converted to NaT).")
    except Exception as e:
        print(f"  Error converting '{application_date_col}' to datetime: {e}")


# --- Verifying numerical columns are within a plausible range ---
print("\nChecking numerical ranges...")
numerical_range_checks = {
    income_col: {'min': 0}, # Income cannot be negative
    age_col: {'min': 18, 'max': 100}, # Reasonable age range for borrowers
    loan_amount_col: {'min': 0} # Loan amount cannot be negative
    # Add other numerical columns and their plausible ranges
}

for col, ranges in numerical_range_checks.items():
    if col in df.columns and np.issubdtype(df[col].dtype, np.number):
        if 'min' in ranges:
            invalid_min = df[df[col] < ranges['min']]
            if not invalid_min.empty:
                print(f"  Warning: Column '{col}' contains values below the plausible minimum ({ranges['min']}). Examples: {invalid_min[col].head().tolist()}")
            else:
                print(f"  Column '{col}' values are within plausible minimum range.")
        if 'max' in ranges:
            invalid_max = df[df[col] > ranges['max']]
            if not invalid_max.empty:
                 print(f"  Warning: Column '{col}' contains values above the plausible maximum ({ranges['max']}). Examples: {invalid_max[col].head().tolist()}")
            else:
                print(f"  Column '{col}' values are within plausible maximum range.")
    elif col in df.columns:
         print(f"  Column '{col}' is not numerical, skipping range check.")
    else:
         print(f"  Column '{col}' not found, skipping range check.")


# --- Checking categorical columns for unexpected values ---
print("\nChecking categorical values...")
categorical_expected_values = {
    gender_col: ['Male', 'Female', np.nan], # Include NaN if missing values are expected
    marital_status_col: ['Single', 'Married', 'Divorced', 'Widowed', np.nan],
    education_col: ['Primary', 'Secondary', 'Tertiary', 'Other', np.nan]
    # Add other categorical columns and their expected values
}

for col, expected_values in categorical_expected_values.items():
    if col in df.columns:
        # Get unique values, including NaN if present
        unique_values = df[col].unique()
        # Check for unexpected values (values not in the expected list, ignoring NaN check if NaN is expected)
        unexpected_values = [val for val in unique_values if pd.notna(val) and val not in expected_values]
        if unexpected_values:
            print(f"  Warning: Column '{col}' contains unexpected values: {unexpected_values}")
        else:
            print(f"  Column '{col}' contains only expected values.")
    else:
        print(f"  Column '{col}' not found, skipping categorical check.")


# --- Ensuring unique identifiers are unique and not null ---
print("\nChecking unique identifiers...")
if customer_id_col in df.columns:
    if df[customer_id_col].isnull().any():
        print(f"  Warning: Column '{customer_id_col}' contains null values.")
    else:
        print(f"  Column '{customer_id_col}' has no null values.")

    if not df[customer_id_col].is_unique:
        print(f"  Warning: Column '{customer_id_col}' is not unique. Number of duplicates: {df[customer_id_col].duplicated().sum()}")
    else:
        print(f"  Column '{customer_id_col}' is unique.")
else:
    print(f"  Unique identifier column '{customer_id_col}' not found, skipping uniqueness check.")


# 3. Implement checks for inconsistencies across columns
print("\nChecking inconsistencies across columns...")

# Example 1: Loan amount is not zero for approved loans (assuming target_col indicates loan status, e.g., 1 for approved)
if loan_amount_col in df.columns and target_col in df.columns:
    # Assuming target_col == 1 means approved loan
    approved_loans_with_zero_amount = df[(df[target_col] == 1) & (df[loan_amount_col] == 0)]
    if not approved_loans_with_zero_amount.empty:
        print(f"  Warning: Found {len(approved_loans_with_zero_amount)} approved loans with a loan amount of zero.")
    else:
        print("  No approved loans found with a loan amount of zero.")
else:
    print(f"  Cannot check for approved loans with zero amount: Missing '{loan_amount_col}' or '{target_col}'.")


# Example 2: Credit score is within a valid range (e.g., 300-850 if FICO scale is used)
if credit_score_col in df.columns and np.issubdtype(df[credit_score_col].dtype, np.number):
    credit_score_min = 300 # Example minimum
    credit_score_max = 850 # Example maximum
    invalid_credit_scores = df[(df[credit_score_col] < credit_score_min) | (df[credit_score_col] > credit_score_max)]
    if not invalid_credit_scores.empty:
        print(f"  Warning: Found {len(invalid_credit_scores)} records with credit scores outside the valid range ({credit_score_min}-{credit_score_max}). Examples: {invalid_credit_scores[credit_score_col].head().tolist()}")
    else:
        print(f"  All credit scores are within the valid range ({credit_score_min}-{credit_score_max}).")
else:
    print(f"  Cannot check credit score range: Missing '{credit_score_col}' or it's not numerical.")

# Add other cross-column checks as relevant (e.g., Age vs. Employment History Length)


# 4. Summarize the findings from the data validation checks
print("\n--- Summary of Data Validation Findings ---")
print("Review the warnings and messages above to identify potential data quality issues and anomalies.")
print("Common issues found include:")
print("  - Unexpected data types in key columns.")
print("  - Numerical values outside plausible ranges.")
print("  - Unexpected or inconsistent values in categorical columns.")
print("  - Non-unique or null values in identifier columns.")
print("  - Inconsistencies between related columns (e.g., approved loans with zero amount).")
print("\nFurther investigation and handling (cleaning, transformation, imputation) may be required based on these findings.")

print("\nData validation checks completed.")

Error: 'south_african_credit_risk.csv' not found. Please make sure the file is in the correct directory.


FileNotFoundError: [Errno 2] No such file or directory: 'south_african_credit_risk.csv'

## Documentation and reproducibility

### Subtask:
Document the entire data cleaning and preprocessing pipeline to ensure reproducibility and transparency, which is crucial in regulated environments like credit risk modeling.


## Summary:

### Data Analysis Key Findings

*   The dataset `south_african_credit_risk.csv` was not found, preventing the execution of all subsequent data cleaning and preprocessing steps.
*   Attempts to perform initial data exploration, handle missing values, address outliers, engineer features, encode categorical variables, transform and scale data, handle imbalanced data, and perform data validation checks all failed due to the missing data file.

### Insights or Next Steps

*   Ensure the `south_african_credit_risk.csv` file is correctly placed in the directory where the analysis is being executed to allow the data loading and subsequent steps to proceed.
*   Once the data is accessible, re-run the entire pipeline to complete the planned data cleaning and preprocessing tasks.
