#### Week 4 Project
 AI-Powered Data Analysis & Automation
 NO WORD COUNT
#### Objective:
 This project aims to apply AI-driven techniques for data cleaning,
 visualization, predictive analytics, and automation. Participants will
 use AI tools such as Power BI, Google AutoML, and Python to
 analyze data, generate insights, and enhance business decision
making.
Tools Needed
 Google AutoML – For AI-powered data preprocessing and
 predictive modeling
 Power BI – For AI-driven data visualization and automated
 insights
 Python (Pandas, Scikit-learn, Matplotlib, Seaborn) – For
 advanced data analysis and modeling


#### Task 1: AI-Powered Data Cleaning and Preprocessing
 This task involves cleaning and preparing the dataset using AI tools to
 handle missing values, detect outliers, and ensure data consistency.
 Steps to Follow:
 Step 1: Upload the Dataset
 1) Download the dataset from the provided source.
 2) Open Google AutoML or Power BI.
 3) Upload the dataset into the tool:
 Google AutoML: Click on "New Dataset," select the file, and
 upload it.
 Power BI: Open Power BI, go to "Home" → "Get Data" →
 "Excel/CSV" → Select the dataset and click "Load."


#### Load the Dataset

In [7]:
import pandas as pd

# Step 1: Load the dataset from a CSV file
file_path = "raw_dataset_week4.csv"  # Replace with actual file path if needed
df = pd.read_csv(file_path)

# Display the first few rows to confirm successful loading
print("Dataset loaded successfully. Here are the first 5 rows:")
print(df.head())

# Display basic info to check for missing values and data types
print("\nDataset Info:")
print(df.info())

Dataset loaded successfully. Here are the first 5 rows:
   Customer_ID  Age  Gender    Income  Spending_Score  Credit_Score  \
0            1   56  Female  142418.0               7         391.0   
1            2   69    Male   63088.0              82         652.0   
2            3   46    Male  136868.0              91         662.0   
3            4   32  Female       NaN              34         644.0   
4            5   60    Male   59811.0              91         469.0   

   Loan_Amount  Previous_Defaults  Marketing_Spend  Purchase_Frequency  \
0       8083.0                  1            15376                   3   
1      34328.0                  2             6889                   6   
2      47891.0                  2             6054                  29   
3      25103.0                  2             4868                   8   
4      44891.0                  1            17585                  12   

  Seasonality  Sales  Customer_Churn  Defaulted  
0         Low  32526  

#### Step 2: Handle Missing Values 

In [17]:
import pandas as pd

# Step 2: Handle missing values
# Reload the dataset to ensure we start with the original data
file_path = "raw_dataset_week4.csv"  # Replace with your actual file path
df = pd.read_csv(file_path)

# Check initial missing values (should show 50 for Income, Credit_Score, Loan_Amount)
print("Missing values before cleaning:")
print(df.isnull().sum())

# Fill missing values with column means for numerical columns with NaN
columns_with_missing = ["Income", "Credit_Score", "Loan_Amount"]
for col in columns_with_missing:
    mean_value = df[col].mean()
    df[col] = df[col].fillna(mean_value)  # Direct assignment to avoid inplace warning
    print(f"Filled '{col}' missing values with mean: {mean_value:.2f}")

# Verify no missing values remain
print("\nMissing values after cleaning:")
print(df.isnull().sum())

# Display the first few rows to confirm changes
print("\nFirst 5 rows after handling missing values:")
print(df.head())

# Optionally save the updated dataset (uncomment to save)
# df.to_csv("cleaned_data_step2.csv", index=False)
# print("\nDataset saved as 'cleaned_data_step2.csv'")

Missing values before cleaning:
Customer_ID            0
Age                    0
Gender                 0
Income                50
Spending_Score         0
Credit_Score          50
Loan_Amount           50
Previous_Defaults      0
Marketing_Spend        0
Purchase_Frequency     0
Seasonality            0
Sales                  0
Customer_Churn         0
Defaulted              0
dtype: int64
Filled 'Income' missing values with mean: 84398.06
Filled 'Credit_Score' missing values with mean: 573.41
Filled 'Loan_Amount' missing values with mean: 28456.93

Missing values after cleaning:
Customer_ID           0
Age                   0
Gender                0
Income                0
Spending_Score        0
Credit_Score          0
Loan_Amount           0
Previous_Defaults     0
Marketing_Spend       0
Purchase_Frequency    0
Seasonality           0
Sales                 0
Customer_Churn        0
Defaulted             0
dtype: int64

First 5 rows after handling missing values:
   Customer_ID  A

Observations for Step 2: Handle Missing Values
Dataset State Before Cleaning:
Missing Values Identified: 
Income: 50 missing values.

Credit_Score: 50 missing values.

Loan_Amount: 50 missing values.

All other columns (Customer_ID, Age, Gender, Spending_Score, Previous_Defaults, Marketing_Spend, Purchase_Frequency, Seasonality, Sales, Customer_Churn, Defaulted): 0 missing values.

Total Missing Values: 150 (50 per affected column), representing 10% of the 500 entries in each of the three columns.

Implication: The dataset was incomplete for key financial metrics (Income, Credit_Score, Loan_Amount), which could impact analysis or modeling if not addressed.

Cleaning Process:
Method Applied: Missing values in Income, Credit_Score, and Loan_Amount were filled with their respective column means, calculated from the 450 non-null entries in each column.

Mean Values Used:
Income: 84,398.06 (average income across non-missing entries).

Credit_Score: 573.41 (average credit score).

Loan_Amount: 28,456.93 (average loan amount).

Execution: The code reloaded the original dataset and applied the mean imputation successfully, ensuring the "before cleaning" state accurately reflects the initial 50 missing values per column.

Dataset State After Cleaning:
Missing Values Eliminated: 
All columns now show 0 missing values, confirming that the imputation filled all 150 gaps.

First 5 Rows:
Customer 4’s Income, originally missing, is now 84398.055556 (the mean), while Credit_Score (644.0) and Loan_Amount (25103.0) were already present.

Other rows (e.g., Customers 1, 2, 3, 5) retain their original values, as they had no missing data in these columns.

Data Integrity: The structure remains intact with 500 rows and 14 columns, and no unintended changes occurred to non-missing values.

Key Observations:
Effectiveness: The mean imputation successfully resolved all missing values, making the dataset complete for further analysis.

Mean Values:
Income mean of 84,398.06 suggests a moderate average income, with some variability (original values range from ~20,000 to ~149,000).

Credit_Score mean of 573.41 is reasonable for a typical credit score range (300–850), indicating a generally creditworthy population.

Loan_Amount mean of 28,456.93 reflects a moderate average loan size, with observed values ranging from ~5,000 to ~49,000.

Impact on Data: Filling with means preserves the central tendency but may slightly reduce variance in these columns, potentially masking extreme cases. This is a trade-off of mean imputation versus other methods (e.g., median or interpolation).

Consistency: The output aligns with expectations—Customer 4’s missing Income is now filled, and the rest of the data remains unchanged, validating the process.

#### Step 3: Detect and Handle Outliers


In [23]:
import pandas as pd
import numpy as np

# Step 3: Detect and handle outliers using IQR method
# Select numerical columns for outlier detection
numerical_columns = ["Income", "Spending_Score", "Credit_Score", "Loan_Amount", 
                     "Previous_Defaults", "Marketing_Spend", "Purchase_Frequency", "Sales"]

# Calculate Q1, Q3, and IQR for each numerical column
Q1 = df[numerical_columns].quantile(0.25)
Q3 = df[numerical_columns].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Print bounds for reference
print("Lower bounds for outliers:")
print(lower_bound)
print("\nUpper bounds for outliers:")
print(upper_bound)

# Remove rows with values outside the bounds (outliers)
df_cleaned = df[~((df[numerical_columns] < lower_bound) | (df[numerical_columns] > upper_bound)).any(axis=1)]

# Display results
print("\nOriginal dataset shape:", df.shape)
print("Cleaned dataset shape after removing outliers:", df_cleaned.shape)
print("\nNumber of rows removed:", df.shape[0] - df_cleaned.shape[0])

# Display first 5 rows of cleaned dataset
print("\nFirst 5 rows of cleaned dataset:")
print(df_cleaned.head())

Lower bounds for outliers:
Income               -46872.375
Spending_Score          -51.125
Credit_Score            109.125
Loan_Amount           -8426.000
Previous_Defaults        -3.000
Marketing_Spend       -7545.875
Purchase_Frequency      -14.500
Sales                -41516.875
dtype: float64

Upper bounds for outliers:
Income                216110.625
Spending_Score           153.875
Credit_Score            1026.125
Loan_Amount            64850.000
Previous_Defaults          5.000
Marketing_Spend        28687.125
Purchase_Frequency        45.500
Sales                 150548.125
dtype: float64

Original dataset shape: (500, 14)
Cleaned dataset shape after removing outliers: (500, 14)

Number of rows removed: 0

First 5 rows of cleaned dataset:
   Customer_ID  Age  Gender         Income  Spending_Score  Credit_Score  \
0            1   56  Female  142418.000000               7         391.0   
1            2   69    Male   63088.000000              82         652.0   
2            3

#### Process Overview
Method: The Interquartile Range (IQR) method was used to detect outliers in eight numerical columns: Income, Spending_Score, Credit_Score, Loan_Amount, Previous_Defaults, Marketing_Spend, Purchase_Frequency, and Sales.

Steps: 
Calculated Q1 (25th percentile) and Q3 (75th percentile) for each column.

Computed IQR = Q3 - Q1.

Defined outlier bounds as:
Lower bound = Q1 - 1.5 * IQR.

Upper bound = Q3 + 1.5 * IQR.

Removed rows where any value fell outside these bounds.

#### Outlier Bounds
Lower Bounds:
Income: -46,872.375

Spending_Score: -51.125

Credit_Score: 109.125

Loan_Amount: -8,426.000

Previous_Defaults: -3.000

Marketing_Spend: -7,545.875

Purchase_Frequency: -14.500

Sales: -41,516.875

Upper Bounds:
Income: 216,110.625

Spending_Score: 153.875

Credit_Score: 1,026.125

Loan_Amount: 64,850.000

Previous_Defaults: 5.000

Marketing_Spend: 28,687.125

Purchase_Frequency: 45.500

Sales: 150,548.125

Results
Original Shape: 500 rows, 14 columns.

Cleaned Shape: 500 rows, 14 columns.

Rows Removed: 0 (no change in dataset size).

First 5 Rows: Identical to the post-Step 2 dataset, with no alterations.

#### Key Observations
No Outliers Detected:
Despite applying the IQR method, no rows were flagged as outliers, meaning all 500 rows had values within the calculated bounds for all eight numerical columns.

This suggests the dataset is unusually uniform or that the IQR bounds are wide enough to encompass all observed values.

Bounds Analysis:
Income (-46,872.375 to 216,110.625): 
Observed range (~20,000 to ~149,000) fits well within this wide range. Even extremes like 149,922 (Customer 196) are below 216,110.625.

Spending_Score (-51.125 to 153.875): 
Range is 1 to 99 (e.g., Customer 37 has 1, Customer 147 has 99). Since the lower bound is negative (impossible) and the upper bound exceeds 100, no outliers are possible here.

Credit_Score (109.125 to 1,026.125): 
Range (~300 to ~848) is fully contained. Even 848 (Customer 194) is below 1,026.125.

Loan_Amount (-8,426.000 to 64,850.000): 
Range (~5,000 to ~49,000) fits comfortably. No negative loans exist, and 49,936 (Customer 116) is below 64,850.

Previous_Defaults (-3.000 to 5.000): 
Values are 0, 1, or 2—well within bounds (negative is impossible).

Marketing_Spend (-7,545.875 to 28,687.125): 
Range (~1,234 to ~19,990) fits, with 19,990 (Customer 71) below 28,687.125.

Purchase_Frequency (-14.500 to 45.500): 
Range (1 to 29) is contained, with no negative values possible.

Sales (-41,516.875 to 150,548.125): 
Range (~5,000 to ~99,835) fits, with 99,835 (Customer 169) below 150,548.125.

Why No Rows Removed?:
The IQR bounds are quite broad due to the dataset’s natural variability (e.g., Income IQR spans from ~50,000 to ~130,000, extended by 1.5x).

All observed values fall within these generous ranges, indicating either:
The data is well-behaved with no extreme outliers.

The 1.5 * IQR threshold is too lenient for this dataset’s distribution.



#### Step 4: Save the Cleaned Data

In [52]:
# Step 4: Save the cleaned dataset
output_file = "cleaned_data"
df.to_csv(output_file, index=False)

# Confirm the save
print(f"Cleaned dataset saved successfully as '{output_file}'")
print("Final dataset shape:", df.shape)
print("\nFirst 5 rows of the saved dataset:")
print(df.head())

Cleaned dataset saved successfully as 'cleaned_data'
Final dataset shape: (500, 14)

First 5 rows of the saved dataset:
   Customer_ID  Age  Gender         Income  Spending_Score  Credit_Score  \
0            1   56  Female  142418.000000               7         391.0   
1            2   69    Male   63088.000000              82         652.0   
2            3   46    Male  136868.000000              91         662.0   
3            4   32  Female   84398.055556              34         644.0   
4            5   60    Male   59811.000000              91         469.0   

   Loan_Amount  Previous_Defaults  Marketing_Spend  Purchase_Frequency  \
0       8083.0                  1            15376                   3   
1      34328.0                  2             6889                   6   
2      47891.0                  2             6054                  29   
3      25103.0                  2             4868                   8   
4      44891.0                  1            17585   

Final check after Cleaning the Dataset

In [46]:
import pandas as pd
from io import StringIO  # Import StringIO from io module

# Load the cleaned dataset
# Option 1: If checking the saved file directly (recommended for your local environment)
file_path = "cleaned_data.csv"  # Adjust to your actual file path
df = pd.read_csv(file_path)

# Check for missing/null values
print("Missing/Null Values Check:")
print(df.isnull().sum())

# Check data types
print("\nData Types:")
print(df.dtypes)

# Basic consistency checks
print("\nBasic Consistency Checks:")
# Check for negative values where they shouldn’t exist
numerical_cols = ["Age", "Income", "Spending_Score", "Credit_Score", "Loan_Amount", 
                  "Previous_Defaults", "Marketing_Spend", "Purchase_Frequency", "Sales"]
for col in numerical_cols:
    negatives = df[df[col] < 0].shape[0]
    print(f"Number of negative values in '{col}': {negatives}")

# Check categorical columns for unexpected values
print("\nGender Unique Values:", df["Gender"].unique())
print("Seasonality Unique Values:", df["Seasonality"].unique())
print("Customer_Churn Unique Values:", df["Customer_Churn"].unique())
print("Defaulted Unique Values:", df["Defaulted"].unique())

# Summary statistics for numerical columns
print("\nSummary Statistics:")
print(df[numerical_cols].describe())

Missing/Null Values Check:
Customer_ID           0
Age                   0
Gender                0
Income                0
Spending_Score        0
Credit_Score          0
Loan_Amount           0
Previous_Defaults     0
Marketing_Spend       0
Purchase_Frequency    0
Seasonality           0
Sales                 0
Customer_Churn        0
Defaulted             0
dtype: int64

Data Types:
Customer_ID             int64
Age                     int64
Gender                 object
Income                float64
Spending_Score          int64
Credit_Score          float64
Loan_Amount           float64
Previous_Defaults       int64
Marketing_Spend         int64
Purchase_Frequency      int64
Seasonality            object
Sales                   int64
Customer_Churn          int64
Defaulted               int64
dtype: object

Basic Consistency Checks:
Number of negative values in 'Age': 0
Number of negative values in 'Income': 0
Number of negative values in 'Spending_Score': 0
Number of negative val

#### Results Summary
Missing/Null Values  
Finding: No missing or null values detected across all 14 columns (500 rows each).

Details: Customer_ID, Age, Gender, Income, Spending_Score, Credit_Score, Loan_Amount, Previous_Defaults, Marketing_Spend, Purchase_Frequency, Seasonality, Sales, Customer_Churn, and Defaulted all show 0 nulls.

Conclusion: Step 2’s mean imputation successfully filled all 150 original missing values.

Data Types  
Finding: Data types are consistent and appropriate.

Details: 
Integers (int64): Customer_ID, Age, Spending_Score, Previous_Defaults, Marketing_Spend, Purchase_Frequency, Sales, Customer_Churn, Defaulted.

Floats (float64): Income, Credit_Score, Loan_Amount (due to mean imputation).

Objects (object): Gender, Seasonality (categorical).

Conclusion: Types align with data content; float precision from imputation is expected.

Consistency Checks  
Negative Values: No negative values found in Age, Income, Spending_Score, Credit_Score, Loan_Amount, Previous_Defaults, Marketing_Spend, Purchase_Frequency, or Sales.

Categorical Values: 
Gender: Only "Female" and "Male".

Seasonality: Only "Low", "Medium", "High".

Customer_Churn and Defaulted: Only 0 and 1.

Conclusion: No illogical or unexpected values detected; data adheres to expected ranges and categories.

#### Summary Statistics  
Key Metrics:
Age: Mean 44.22, Min 18, Max 69.

Income: Mean 84,398.06, Min 20,055, Max 149,922, Std 38,049.40 (imputation centralized some values).

Credit_Score: Mean 573.41, Min 300, Max 848, Std 149.30.

Loan_Amount: Mean 28,456.93, Min 5,163, Max 49,936, Std 11,788.25.

Sales: Mean 54,378.95, Min 5,203, Max 99,835, Std 27,263.11.

Observation: Ranges are reasonable; means reflect imputation (e.g., Income median = mean due to 50 imputed values). No outliers were removed (Step 3), so extremes persist (e.g., Income 149,922).



#### Task 3: AI-Driven Predictive and Prescriptive Analytics

#### Step 2: Evaluate Model Performance



In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import numpy as np
import xgboost as xgb

# Load the cleaned dataset
df = pd.read_csv("cleaned_data.csv")

# Clean Seasonality
df['Seasonality'] = df['Seasonality'].str.strip().str.capitalize()
print("Cleaned unique values in 'Seasonality':", df['Seasonality'].unique())
print("Value counts:\n", df['Seasonality'].value_counts())

# Encode Seasonality, dropping 'Low' as baseline
df_encoded = pd.get_dummies(df, columns=['Seasonality'])
df_encoded = df_encoded.drop('Seasonality_Low', axis=1)

# Cap extreme Sales values
sales_cap = df_encoded['Sales'].quantile(0.95)
df_encoded['Sales'] = df_encoded['Sales'].clip(upper=sales_cap)

# Prepare features (X) and target (y)
numeric_features = ['Age', 'Income', 'Spending_Score', 'Credit_Score', 'Loan_Amount', 
                    'Previous_Defaults', 'Marketing_Spend', 'Purchase_Frequency']
seasonality_cols = ['Seasonality_High', 'Seasonality_Medium']
X = df_encoded[numeric_features + seasonality_cols].copy()

# Add interaction features using .loc
X.loc[:, 'Spend_x_Freq'] = X['Marketing_Spend'] * X['Purchase_Frequency']
X.loc[:, 'Score_x_Income'] = X['Spending_Score'] * X['Income']

# Add polynomial features (degree 2) for key predictors
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X[['Marketing_Spend', 'Purchase_Frequency', 'Spending_Score', 'Income']])
poly_cols = poly.get_feature_names_out(['Marketing_Spend', 'Purchase_Frequency', 'Spending_Score', 'Income'])
X_poly_df = pd.DataFrame(X_poly, columns=poly_cols, index=X.index)

# Combine original and polynomial features
X = pd.concat([X, X_poly_df], axis=1)

# Debug: Check features
print("Features (X) columns:", X.columns.tolist())
print("Sample X data:\n", X.head())

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

y = df_encoded['Sales']

# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# XGBoost with broader tuning
param_grid_xgb = {
    'n_estimators': [200, 300, 400],
    'max_depth': [5, 7, 9],
    'learning_rate': [0.05, 0.1, 0.15]
}
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
grid_search_xgb = GridSearchCV(xgb_model, param_grid_xgb, cv=5, scoring='r2', n_jobs=-1)
grid_search_xgb.fit(X_train, y_train)

# Best XGBoost model
best_model = grid_search_xgb.best_estimator_
print("Best XGBoost parameters:", grid_search_xgb.best_params_)

# Predictions
predictions = best_model.predict(X_test)

# Evaluate performance
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, predictions)

# Print results
print("\nXGBoost Performance Evaluation:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': best_model.feature_importances_
})
print("\nXGBoost Feature Importance:")
print(feature_importance.sort_values(by='Importance', ascending=False))

# Baseline check
mean_sales = y_test.mean()
baseline_mse = mean_squared_error(y_test, [mean_sales] * len(y_test))
print("\nBaseline (Mean) MSE:", baseline_mse)

Cleaned unique values in 'Seasonality': ['Low' 'Medium' 'High']
Value counts:
 Seasonality
Medium    169
High      168
Low       163
Name: count, dtype: int64
Features (X) columns: ['Age', 'Income', 'Spending_Score', 'Credit_Score', 'Loan_Amount', 'Previous_Defaults', 'Marketing_Spend', 'Purchase_Frequency', 'Seasonality_High', 'Seasonality_Medium', 'Spend_x_Freq', 'Score_x_Income', 'Marketing_Spend', 'Purchase_Frequency', 'Spending_Score', 'Income', 'Marketing_Spend^2', 'Marketing_Spend Purchase_Frequency', 'Marketing_Spend Spending_Score', 'Marketing_Spend Income', 'Purchase_Frequency^2', 'Purchase_Frequency Spending_Score', 'Purchase_Frequency Income', 'Spending_Score^2', 'Spending_Score Income', 'Income^2']
Sample X data:
    Age         Income  Spending_Score  Credit_Score  Loan_Amount  \
0   56  142418.000000               7         391.0       8083.0   
1   69   63088.000000              82         652.0      34328.0   
2   46  136868.000000              91         662.0      47

Seasonality Cleaned Values:
Cleaned unique values in 'Seasonality': ['Low', 'Medium', 'High']

Value counts for 'Seasonality':

Medium: 169

High: 168

Low: 163

Feature Engineering:
The dataset now includes several new interaction and polynomial features to capture complex relationships:

Interaction features like Spend_x_Freq, Score_x_Income.

Polynomial features for Marketing_Spend, Purchase_Frequency, Spending_Score, and Income.

Model Performance:
Best XGBoost Parameters:

Learning Rate: 0.05

Max Depth: 7

Number of Estimators: 200

Performance Metrics:

Mean Squared Error (MSE): 899,290,526.93

Root Mean Squared Error (RMSE): 29,988.17

R-squared (R²): -0.11

Baseline (Mean) MSE: 813,101,997.68

Feature Importance:
The most important features contributing to the model include:

Marketing_Spend Spending_Score

Spend_x_Freq

Purchase_Frequency Spending_Score

Purchase_Frequency Income

Previous_Defaults

Purchase_Frequency

Credit_Score

Key Insights:
Negative R-squared (R²):

The negative R² value (-0.11) indicates that the model is currently performing worse than a horizontal line representing the mean of the target variable (sales).

This suggests that the current set of features and their transformations are not effectively capturing the relationships needed to predict sales.

Model Accuracy:

While the MSE and RMSE values provide a measure of prediction error, the R² value shows that the model requires further refinement to improve its predictive power.

Feature Impact:

The feature importance analysis reveals that interaction terms like Marketing_Spend Spending_Score and Spend_x_Freq are significant contributors.

It's essential to explore further feature engineering or possibly remove less relevant features to enhance the model's performance.

#### Task 4: AI for Business Strategy and Risk Management

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Selecting features and target variable
X = df[['Income', 'Loan_Amount', 'Credit_Score']]
y = df['Defaulted']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Making predictions
predictions = model.predict(X_test)

# Model accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Model Accuracy: {accuracy:.2f}')


Model Accuracy: 0.82


~ END ~