# üìä Loan Approval Dataset - Interactive Analysis

This notebook provides an interactive environment for exploring the loan approval dataset.

**Learning Objectives:**
- Load and explore data using pandas
- Visualize distributions and relationships
- Understand data cleaning techniques
- Prepare data for machine learning

---

## 1Ô∏è‚É£ Setup and Import Libraries

First, let's import all the libraries we'll need for our analysis.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats
from scipy.stats import normaltest, shapiro

# Display settings
import warnings
warnings.filterwarnings('ignore')

# Make plots appear in the notebook
%matplotlib inline

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì All libraries imported successfully!")

## 2Ô∏è‚É£ Load the Dataset

Let's load our loan approval dataset. You can use either:
- `sample_loan_data.csv` (provided sample)
- `loan_approval.csv` (your own dataset)

In [None]:
# Load the dataset
# Update the filename if using a different file
df = pd.read_csv('../data/sample_loan_data.csv')

print(f"‚úì Dataset loaded successfully!")
print(f"  Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")

## 3Ô∏è‚É£ Initial Data Exploration

Let's take a first look at our data.

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Get dataset information
print("Dataset Information:")
df.info()

In [None]:
# Statistical summary for numerical columns
print("Statistical Summary (Numerical Columns):")
df.describe()

## 4Ô∏è‚É£ Missing Values Analysis

Identifying and understanding missing values is crucial for data quality.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

# Create a DataFrame for better visualization
missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing_Count': missing_values.values,
    'Missing_Percentage': missing_percentage.values
})

# Show only columns with missing values
missing_df = missing_df[missing_df['Missing_Count'] > 0]

if len(missing_df) > 0:
    print("Columns with Missing Values:")
    print(missing_df)
else:
    print("‚úì No missing values found!")

In [None]:
# Visualize missing values
if len(missing_df) > 0:
    plt.figure(figsize=(10, 6))
    plt.bar(missing_df['Column'], missing_df['Missing_Percentage'], color='salmon')
    plt.xlabel('Columns')
    plt.ylabel('Missing Percentage (%)')
    plt.title('Missing Values by Column')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## 5Ô∏è‚É£ Distribution Analysis

Understanding the distribution of features helps us choose appropriate analysis techniques.

In [None]:
# Select numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

print(f"Numerical columns: {numerical_cols}")

In [None]:
# Create histograms for all numerical features
if len(numerical_cols) > 0:
    fig, axes = plt.subplots(len(numerical_cols), 2, figsize=(15, 5 * len(numerical_cols)))
    
    for idx, col in enumerate(numerical_cols):
        # Remove missing values
        data = df[col].dropna()
        
        # Histogram
        axes[idx, 0].hist(data, bins=30, edgecolor='black', alpha=0.7)
        axes[idx, 0].set_title(f'{col} - Histogram')
        axes[idx, 0].set_xlabel(col)
        axes[idx, 0].set_ylabel('Frequency')
        
        # Box plot
        axes[idx, 1].boxplot(data, vert=True)
        axes[idx, 1].set_title(f'{col} - Box Plot')
        axes[idx, 1].set_ylabel(col)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Normality tests for numerical columns
print("Normality Tests (Shapiro-Wilk):")
print("-" * 60)

for col in numerical_cols:
    data = df[col].dropna()
    if len(data) > 3:  # Need at least 3 samples
        stat, p_value = shapiro(data)
        is_normal = "‚úì Normal" if p_value > 0.05 else "‚úó Not Normal"
        print(f"{col:20s} p-value: {p_value:.4f}  ‚Üí  {is_normal}")

## 6Ô∏è‚É£ Categorical Variables Analysis

In [None]:
# Select categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"Categorical columns: {categorical_cols}")

In [None]:
# Create bar charts for categorical variables
if len(categorical_cols) > 0:
    fig, axes = plt.subplots((len(categorical_cols) + 1) // 2, 2, 
                             figsize=(15, 4 * ((len(categorical_cols) + 1) // 2)))
    axes = axes.flatten() if len(categorical_cols) > 1 else [axes]
    
    for idx, col in enumerate(categorical_cols):
        value_counts = df[col].value_counts()
        axes[idx].bar(value_counts.index, value_counts.values, color='steelblue')
        axes[idx].set_title(f'{col} - Distribution')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Count')
        axes[idx].tick_params(axis='x', rotation=45)
        
        # Add count labels on bars
        for i, v in enumerate(value_counts.values):
            axes[idx].text(i, v, str(v), ha='center', va='bottom')
    
    # Hide extra subplots
    for idx in range(len(categorical_cols), len(axes)):
        axes[idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()

## 7Ô∏è‚É£ Correlation Analysis

Understanding relationships between variables.

In [None]:
# Calculate correlation matrix for numerical features
if len(numerical_cols) > 1:
    correlation_matrix = df[numerical_cols].corr()
    
    # Create heatmap
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
                square=True, linewidths=1, fmt='.2f')
    plt.title('Correlation Matrix - Numerical Features')
    plt.tight_layout()
    plt.show()
    
    print("\nStrong correlations (|r| > 0.5):")
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            if abs(correlation_matrix.iloc[i, j]) > 0.5:
                print(f"  {correlation_matrix.columns[i]} ‚Üî {correlation_matrix.columns[j]}: "
                      f"{correlation_matrix.iloc[i, j]:.3f}")

## 8Ô∏è‚É£ Target Variable Analysis

If your dataset has a target variable (e.g., `Loan_Status`), analyze it here.

In [None]:
# Check if Loan_Status exists
if 'Loan_Status' in df.columns:
    target_counts = df['Loan_Status'].value_counts()
    target_pct = df['Loan_Status'].value_counts(normalize=True) * 100
    
    print("Loan Status Distribution:")
    print(f"  Approved (Y): {target_counts.get('Y', 0)} ({target_pct.get('Y', 0):.1f}%)")
    print(f"  Rejected (N): {target_counts.get('N', 0)} ({target_pct.get('N', 0):.1f}%)")
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Bar chart
    axes[0].bar(target_counts.index, target_counts.values, color=['green', 'red'])
    axes[0].set_title('Loan Status - Bar Chart')
    axes[0].set_xlabel('Status')
    axes[0].set_ylabel('Count')
    
    # Pie chart
    axes[1].pie(target_counts.values, labels=target_counts.index, autopct='%1.1f%%',
                colors=['green', 'red'], startangle=90)
    axes[1].set_title('Loan Status - Pie Chart')
    
    plt.tight_layout()
    plt.show()
else:
    print("Loan_Status column not found in dataset.")

## 9Ô∏è‚É£ Your Turn - Exploratory Questions

Try answering these questions using code:

In [None]:
# Question 1: What is the average income of applicants?
# Your code here:


In [None]:
# Question 2: What percentage of applicants are married?
# Your code here:


In [None]:
# Question 3: Is there a relationship between education and loan approval?
# Hint: Use pd.crosstab() or groupby()
# Your code here:


## üéØ Summary

In this notebook, you learned how to:
- ‚úì Load and explore datasets
- ‚úì Identify and visualize missing values
- ‚úì Analyze distributions (numerical and categorical)
- ‚úì Test for normality
- ‚úì Explore correlations
- ‚úì Analyze target variables

**Next Steps:**
1. Run the cleaning script (`3_data_cleaning.py`) to prepare the data
2. Experiment with feature engineering
3. Build predictive models

---
**Happy Learning! üìö**