In [None]:
# Assignment: Banking Transactions

"""
Scenario:
A financial technology company has provided you with raw datasets containing banking transactions and customer details. Your task is to clean and transform this data to make it suitable for financial analysis.

## Dataset Description:

### 1. `customers_raw.csv`
This dataset contains customer details:
- `customer_id`: Unique identifier for each customer.
- `name`: Full name of the customer.
- `age`: Age of the customer.
- `gender`: Gender of the customer (Male/Female/Other).
- `email`: Email address of the customer.
- `city`: City where the customer resides.

### 2. `bank_transactions.csv`
This dataset contains details of financial transactions:
- `transaction_id`: Unique identifier for each transaction.
- `customer_id`: Identifier linking transactions to customers in `customers_raw.csv`.
- `transaction_date`: Date of the transaction.
- `transaction_type`: Type of transaction (Deposit, Withdrawal, Payment, Transfer).
- `amount`: Transaction amount.
- `balance_after_transaction`: Account balance after the transaction.
"""

import pandas as pd
import numpy as np
from faker import Faker
import random

# Initialize Faker
fake = Faker()

# Generate Customer Data
df_customers = pd.DataFrame({
    'customer_id': range(1, 101),
    'name': [fake.name() for _ in range(100)],
    'age': np.random.randint(18, 80, 100),
    'gender': np.random.choice(['Male', 'Female', 'Other'], 100),
    'email': [fake.email() for _ in range(100)],
    'city': [fake.city() for _ in range(100)]
})

df_customers.to_csv('customers_raw.csv', index=False)

# Generate Transaction Data
df_transactions = pd.DataFrame({
    'transaction_id': range(1, 501),
    'customer_id': np.random.choice(df_customers['customer_id'], 500),
    'transaction_date': [fake.date_this_decade() for _ in range(500)],
    'transaction_type': np.random.choice(['Deposit', 'Withdrawal', 'Payment', 'Transfer'], 500),
    'amount': np.round(np.random.uniform(100, 5000, 500), 2),
    'balance_after_transaction': np.round(np.random.uniform(1000, 20000, 500), 2)
})

df_transactions.to_csv('bank_transactions.csv', index=False)

print("Synthetic datasets generated: 'customers_raw.csv' and 'bank_transactions.csv'")

# Assignment Questions:

# 1. Load the dataset and display basic information.
#    - Read `customers_raw.csv` and `bank_transactions.csv` into Pandas DataFrames.
#    - Display the first 5 rows and basic statistics.

# 2. Merge datasets
#    - Merge customer data with transaction records using an inner join.
#    - Display the first 5 rows of the merged DataFrame.

# 3. Perform different types of joins
#    - Merge using left, right, and outer joins.
#    - Compare and explain the results.

# 4. Handle missing values
#    - Identify missing values and summarize their count per column.
#    - Drop or fill missing values appropriately.

# 5. Remove duplicate records
#    - Check for duplicate entries and remove them.

# 6. Rename columns and indexes
#    - Modify column names for better readability.

# 7. Convert categorical data into dummy variables
#    - Apply one-hot encoding to categorical features.

# 8. String manipulations
#    - Standardize customer names (convert to lowercase, remove spaces).
#    - Extract domain names from email addresses.

# 9. Discretization and binning
#    - Categorize transaction amounts into bins: Low, Medium, High.

# 10. Identify and handle outliers
#    - Use statistical methods (IQR, Z-score) to detect outliers.

# 11. Perform random sampling and permutation
#    - Take a random sample of customers.
#    - Shuffle the dataset and compare before and after.

# 12. Grouping and Aggregation
#    - Group transactions by customer ID and compute total spending.
#    - Identify the top 5 customers with the highest transaction amounts.

# 13. Apply group-wise transformations
#    - Normalize transaction amounts within each transaction type category.
#    - Compute the percentage contribution of each transaction to total customer spending.

# 14. Correlation Analysis
#    - Compute the correlation between transaction amount and customer demographics.
#    - Analyze trends in spending across different age groups.

# 15. Data Insights & Visualization
#    - Create histograms for transaction amounts.
#    - Generate a scatter plot to visualize transaction amount vs. age.
#    - Summarize findings and business recommendations.

"""
## **Evaluation Rubrics**

### **1. Data Loading & Initial Exploration (10 points)**
- Correctly loads the dataset (5 points)
- Displays relevant basic information (5 points)

### **2. Data Merging (15 points)**
- Performs correct inner join (5 points)
- Executes left, right, and outer joins correctly (5 points)
- Explains differences between join types (5 points)

### **3. Handling Missing Values (15 points)**
- Identifies and summarizes missing values (5 points)
- Correctly applies different missing value handling techniques (10 points)

### **4. Data Cleaning & Transformation (20 points)**
- Removes duplicates effectively (5 points)
- Properly renames columns and indexes (5 points)
- Converts categorical data using one-hot encoding (5 points)
- Applies correct string manipulations (5 points)

### **5. Advanced Data Processing (20 points)**
- Performs discretization and binning appropriately (5 points)
- Identifies and handles outliers effectively (5 points)
- Executes correct random sampling and permutation (5 points)
- Implements proper group-wise transformations (5 points)

### **6. Correlation & Insights (10 points)**
- Computes correlation between transaction amount and demographics (5 points)
- Draws meaningful conclusions from correlation analysis (5 points)

### **7. Visualization & Reporting (10 points)**
- Creates relevant visualizations (5 points)
- Provides a well-structured summary and business recommendations (5 points)

### **Total: 100 Points**
"""
