# Lab 2: Bank Customer Churn Prediction
## Data Loading and Initial Exploration

**Student:** Hadi Al Moairk  
**Course:** ARTI308 - Machine Learning  
**University:** IAU (Imam Abdulrahman Bin Faisal University)  
**Date:** February 2026

---

### Problem Statement
This notebook demonstrates the initial data loading and exploration for a **Binary Classification** problem. The goal is to predict whether a bank customer will churn (leave the bank) based on their profile and banking behavior.

## Step 1: Import Required Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# For better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ Libraries imported successfully!")

## Step 2: Load the Dataset

In [None]:
# Load the dataset using pandas
df = pd.read_csv('Churn_Modelling.csv')

print("‚úÖ Dataset loaded successfully!")
print(f"\nDataset loaded from: Churn_Modelling.csv")

## Step 3: Display Dataset Shape

In [None]:
# Display the shape of the dataset
print(f"Dataset Shape: {df.shape}")
print(f"\nNumber of Rows (Samples): {df.shape[0]:,}")
print(f"Number of Columns (Features): {df.shape[1]}")

## Step 4: Preview First Rows

In [None]:
# Display the first 5 rows
print("First 5 rows of the dataset:\n")
df.head()

## Step 5: Display Column Names and Data Types

In [None]:
# Display column names
print("Column Names:")
print(df.columns.tolist())

print("\n" + "="*60)

# Display data types
print("\nData Types and Non-Null Counts:")
df.info()

## Step 6: Statistical Summary

In [None]:
# Display statistical summary
print("Statistical Summary of Numerical Features:\n")
df.describe()

## Step 7: Check for Missing Values

In [None]:
# Check for missing values
print("Missing Values per Column:\n")
missing_values = df.isnull().sum()
print(missing_values)

total_missing = missing_values.sum()
print(f"\nTotal Missing Values: {total_missing}")

if total_missing == 0:
    print("\n‚úÖ Great! No missing values found in the dataset.")
else:
    print(f"\n‚ö†Ô∏è Warning: Dataset contains {total_missing} missing values.")

## Step 8: Target Variable Distribution

In [None]:
# Check target variable distribution
print("Target Variable (Exited) Distribution:\n")
target_counts = df['Exited'].value_counts()
print(target_counts)

print("\nPercentage Distribution:")
target_percentage = df['Exited'].value_counts(normalize=True) * 100
print(target_percentage)

# Visualize target distribution
plt.figure(figsize=(8, 5))
df['Exited'].value_counts().plot(kind='bar', color=['#2ecc71', '#e74c3c'])
plt.title('Customer Churn Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Exited (0 = Stayed, 1 = Churned)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## Step 9: Basic Data Insights

In [None]:
# Display basic insights
print("=" * 70)
print("DATASET SUMMARY")
print("=" * 70)

print(f"\nüìä Total Records: {len(df):,}")
print(f"üìã Total Features: {df.shape[1]}")
print(f"üéØ Target Variable: Exited")
print(f"‚úÖ Customers who stayed: {(df['Exited'] == 0).sum():,} ({(df['Exited'] == 0).sum() / len(df) * 100:.2f}%)")
print(f"‚ùå Customers who churned: {(df['Exited'] == 1).sum():,} ({(df['Exited'] == 1).sum() / len(df) * 100:.2f}%)")

# Count numerical and categorical features
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()

print(f"\nüî¢ Numerical Features: {len(numerical_features)}")
print(f"üìù Categorical Features: {len(categorical_features)}")

print("\n" + "=" * 70)
print("‚úÖ Data loading and initial exploration completed successfully!")
print("=" * 70)

## üéì Key Findings

1. **Dataset Size:** 10,000 customer records with 14 features
2. **No Missing Values:** The dataset is complete with no null values
3. **Binary Classification Problem:** Predicting customer churn (Exited: 0 or 1)
4. **Class Distribution:** The dataset shows the distribution of customers who stayed vs. churned
5. **Mixed Data Types:** Contains both numerical (age, balance, salary) and categorical (geography, gender) features

---

## üîÑ Next Steps (Future Labs)

- **Data Preprocessing:** Handle categorical variables, feature scaling
- **Exploratory Data Analysis (EDA):** Visualize relationships between features
- **Feature Engineering:** Create new features, remove irrelevant ones
- **Model Training:** Train various classification algorithms
- **Model Evaluation:** Compare models using accuracy, precision, recall, F1-score
- **Model Optimization:** Hyperparameter tuning and cross-validation