- **IMPORTING MODULES**

In [53]:
import pandas as pd
import numpy as np

- **TASKS**

- **Part A: Conceptual Foundation**

1. Write short notes on:

• What is Data Analysis?

- Data Analysis means collecting, cleaning, and studying data to find useful information. It helps us understand patterns, trends, and relationships in data so that better decisions can be made.

• How to Plan a Data Science Project.

- A typical Data Science Project Plan looks like this:

- Define the Problem – Understand what you’re trying to solve.

- Collect Data – Gather all relevant data.

- Explore and Clean Data – Handle missing values, outliers, and inconsistent formats.

- Feature Engineering – Create or modify features that improve model performance.

- Model Building – Choose and train machine learning models.

- Model Evaluation – Check how well the model performs.

- Deployment – Use the model in a real system or application.

- Monitoring & Maintenance – Track performance and update the model when needed.

• How to Frame a Machine Learning Problem.

- Steps to Frame a ML Problem:

- Understand the Goal – What do we want to predict or classify?

- Identify Inputs (Features) – What information do we have to make predictions?

- Identify Output (Target) – What outcome do we want to predict?

- Select ML Type:
    - upervised Learning – When we have both input and output data (e.g., predict loan default).
    - Unsupervised Learning – When we only have inputs (e.g., group customers by spending habits).

- Define Evaluation Metrics – How will we measure success? (accuracy, RMSE, F1 score, etc.)

- Consider Constraints – Data availability, time, computing power, etc.

2. Explain Tensors and provide an in-depth explanation with NumPy examples.

- A Tensor is a container for numerical data — like scalars, vectors, and matrices — that can have more than two dimensions.

- They are used a lot in machine learning and deep learning (like in TensorFlow or PyTorch).

- Types of Tensors:
    - Scalar (0D Tensor): A single number.
    - Vector (1D Tensor): A list of numbers.
    - Matrix (2D Tensor): A table of numbers (rows and columns).
    - 3D or Higher Tensors: Collections of matrices (used in images, videos, etc.).

- Example:

import numpy as np

tensor_3d = np.array([

[[1, 2], [3, 4]],

[[5, 6], [7, 8]]

])

print("3D Tensor:\n", tensor_3d)

3D Tensor:

[[[1 2]

[3 4]]

[[5 6]

[7 8]]]

- **Part B: Data Acquisition**

3. Import datasets from multiple sources:

- Load CSV files (main transactions dataset).

- Parse JSON files (customer metadata).

- Fetch records from SQL (loan repayment history).

- Fetch data from a dummy API (external economic indicators).

In [54]:
# Load CSV file
data_csv = pd.read_csv("credit_dataset.csv")

In [55]:
# Load JSON file
data_json = pd.read_json("customer_metadata.json")

In [56]:
# Fetch data from a dummy API(external economic indicators)

- **Part C: Data Understanding & Cleaning**


4. Explore the dataset using Pandas (info(), describe()).

In [57]:
print(data_csv.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        1000 non-null   int64  
 1   age                920 non-null    float64
 2   gender             950 non-null    object 
 3   region             1000 non-null   object 
 4   education_level    1000 non-null   object 
 5   employment_type    930 non-null    object 
 6   annual_income      940 non-null    float64
 7   loan_amount        960 non-null    float64
 8   loan_purpose       1000 non-null   object 
 9   credit_score       950 non-null    float64
 10  repayment_history  1000 non-null   int64  
 11  transaction_count  1000 non-null   int64  
 12  spending_ratio     1000 non-null   float64
 13  join_date          1000 non-null   object 
 14  default_flag       1000 non-null   int64  
dtypes: float64(5), int64(4), object(6)
memory usage: 117.3+ KB
None


In [58]:
print(data_csv.describe())

         customer_id         age  annual_income   loan_amount  credit_score  \
count    1000.000000  920.000000     940.000000  9.600000e+02    950.000000   
mean   100499.500000   35.290217   62178.613044  2.633497e+04    647.743158   
std       288.819436    9.496320   66307.077136  6.885842e+04     70.360530   
min    100000.000000   18.000000    4373.610000  5.133500e+02    250.000000   
25%    100249.750000   28.000000   24076.677500  7.484122e+03    607.000000   
50%    100499.500000   35.000000   42389.430000  1.469949e+04    647.000000   
75%    100749.250000   41.000000   77030.370000  2.805263e+04    690.000000   
max    100999.000000   74.000000  728180.593989  1.808443e+06    950.000000   

       repayment_history  transaction_count  spending_ratio  default_flag  
count          1000.0000        1000.000000     1000.000000    1000.00000  
mean              0.5990          60.094000       31.823288       0.12100  
std               1.0016          51.167874       20.210652 

5. Perform Pandas Profiling to generate a data quality report.

In [59]:
# Performing pandas profiling

6. Handle missing data with:
    - Simple Imputer (numerical: mean/median).
    - Simple Imputer (categorical: most frequent).
    - Most Frequent Category Imputation.
    - Missing Indicator + Random Sample Imputation.
    - KNN Imputer (multivariate).
    - MICE Algorithm.
    - Complete Case Analysis (dropping rows/columns).

In [60]:
# Simple Imputer for numerical data
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='mean')
data_csv[['age', 'income']] = num_imputer.fit_transform(data_csv[['age', 'annual_income']])

In [61]:
# Simple Imputer for categorical data
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy='most_frequent')
data_csv[['gender']] = cat_imputer.fit_transform(data_csv[['gender']])

In [62]:
# Most frequent category imputation (impute categorical columns by column names)
# 'employment_type' has missing values in this dataset — impute it using the imputer
cat_imputer = SimpleImputer(strategy="most_frequent")
data_csv[['employment_type']] = cat_imputer.fit_transform(data_csv[['employment_type']])

In [63]:
# Missing indicator + Random sample imputation
from sklearn.impute import MissingIndicator
missing_indicator = MissingIndicator()
data_csv[['credit_score']] = missing_indicator.fit_transform(data_csv[['credit_score']])
random_sample = data_csv['credit_score'].dropna().sample(data_csv['credit_score'].isnull().sum(), random_state=0)
data_csv.loc[data_csv['credit_score'].isnull(), 'credit_score'] = random_sample.values

In [64]:
# KNN Imputer
from sklearn.impute import KNNImputer

In [65]:
# MICE Algorithm
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [66]:
# Complete case analysis (dropping(rows/columns)
# Dropping rows with missing values
data_csv_dropped_rows = data_csv.dropna()

# Dropping columns with missing values
data_csv_dropped_columns = data_csv.dropna(axis=1)

- **Part D: Outlier Handling**

- 7. Detect and treat outliers using:
    - Z-score Method.
    - IQR Method.
    - Percentile Method.
    - Winsorization Technique.

In [67]:
# Detecting outliers using Z-score method
from scipy import stats
threshold = 3
z_scores = np.abs(stats.zscore(data_csv.select_dtypes(include=[np.number])))
outliers = np.where(z_scores > threshold)
data_csv_no_outliers = data_csv[(z_scores < threshold).all(axis=1)]


In [68]:
# Detecting outliers using IQR method

for cols in data_csv.select_dtypes(include=[np.number]).columns:
    Q1 = data_csv[cols].quantile(0.25)
    Q3 = data_csv[cols].quantile(0.75)
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR
    data_csv = data_csv[(data_csv[cols] >= lower_limit) & (data_csv[cols] <= upper_limit)]

In [69]:
# Detecting outliers using percentile method
lower_percentile = 0.01
upper_percentile = 0.99
for cols in data_csv.select_dtypes(include=[np.number]).columns:
    lower_limit = data_csv[cols].quantile(lower_percentile)
    upper_limit = data_csv[cols].quantile(upper_percentile)
    data_csv = data_csv[(data_csv[cols] >= lower_limit) & (data_csv[cols] <= upper_limit)]

In [70]:
# Detecting outliers using winsorization method
from scipy.stats.mstats import winsorize
for cols in data_csv.select_dtypes(include=[np.number]).columns:
    data_csv[cols] = winsorize(data_csv[cols], limits=[0.01, 0.01])

- **Part E: Feature Engineering**

- **8. Handle variable types:**
    - Mixed Variables (numeric + categorical).
    - Date & Time variables → extract Year, Month, Day, Weekday.

In [71]:
# Handle variable types:
# Date & Time extract year, month, day, weekday
data_csv['account_creation_date'] = pd.to_datetime(data_csv['join_date'])
data_csv['account_creation_year'] = data_csv['account_creation_date'].dt.year
data_csv['account_creation_month'] = data_csv['account_creation_date'].dt.month
data_csv['account_creation_day'] = data_csv['account_creation_date'].dt.day
data_csv['account_creation_weekday'] = data_csv['account_creation_date'].dt.weekday

- **9. Encoding categorical variables:**
    - Ordinal Encoding (education levels).
    - Label Encoding (binary features).
    - One-Hot Encoding (regions, loan purpose).

In [72]:
# Encoding categorical variables
# Ordinal Encoding(education levels)
education_mapping = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
data_csv['education_level_encoded'] = data_csv['education_level'].map(education_mapping)

# label encoding(binary features)
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data_csv['gender_encoded'] = label_encoder.fit_transform(data_csv['gender'])

# One-Hot encoding(regions, loan purpose)
data_csv = pd.get_dummies(data_csv, columns=['region', 'loan_purpose'], drop_first=True)

- **10. Encoding numerical features:**
    - Binning (discretize income into groups).
    - Binarization (flag if > threshold).
    - Quantile Binning.
    - K-Means Binning.

In [73]:
# Encoding numerical features 
# Binning (discretization income into groups)
data_csv['income_bin'] = pd.cut(data_csv['annual_income'], bins=[0, 30000, 60000, 90000, 120000, np.inf], labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

# Binarization(flag if > threshold)
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=50000)
data_csv['high_income_flag'] = binarizer.fit_transform(data_csv[['annual_income']])

# Qunatile binning
data_csv['income_quantile'] = pd.qcut(data_csv['annual_income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

# K-means binning
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=0)
data_csv['income_kmeans_bin'] = kmeans.fit_predict(data_csv[['annual_income']])

- **Part F: Feature Scaling**

- **11. Apply multiple scaling methods:**
    - Standardization (Z-score scaling).
    - Normalization.
    - Min-Max Scaling.
    - MaxAbs Scaling.
    - Robust Scaling.

In [74]:
# Applying multiple scaling methods
# Standardization(Z-score scaling)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_csv[['age', 'annual_income', 'credit_score']] = scaler.fit_transform(data_csv[['age', 'annual_income', 'credit_score']])

In [75]:
# Normalization
from sklearn.preprocessing import Normalizer
scaler = Normalizer()
data_csv[['age', 'annual_income', 'credit_score']] = scaler.fit_transform(data_csv[['age', 'annual_income', 'credit_score']])

In [76]:
# Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_csv[['age', 'annual_income', 'credit_score']] = scaler.fit_transform(data_csv[['age', 'annual_income', 'credit_score']])

In [77]:
# Max Abs Scaling
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
data_csv[['age', 'annual_income', 'credit_score']] = scaler.fit_transform(data_csv[['age', 'annual_income', 'credit_score']])

In [78]:
# Robust Scaling
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
data_csv[['age', 'annual_income', 'credit_score']] = scaler.fit_transform(data_csv[['age', 'annual_income', 'credit_score']])

- **Part G: Feature Construction & Transformation**

- **12. Apply transformations:**
    - FunctionTransformer → log transform, reciprocal, square root.
    - PowerTransformer → Box-Cox and Yeo-Johnson.
    - Column Transformer → apply different preprocessing steps to different columns.

In [87]:
# Applying Functional transformer and Power transformer
# Defining the transformers
from sklearn.preprocessing import FunctionTransformer, PowerTransformer
log_transformer = FunctionTransformer(np.log1p, validate=False)
reci_transformer = FunctionTransformer(lambda x: 1 / (x + 1), validate=False)
sqrt_transformer = FunctionTransformer(np.sqrt, validate=False)
boxcox_transformer = PowerTransformer(method='box-cox', standardize=False)
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson', standardize=True)

- **13. Construct new features:**
    - Debt-to-Income ratio.
    - Average monthly transactions.
    - Spending-to-Income ratio.

In [88]:
# Contruct new features
# Debt to Income Ratio
data_csv['debt_to_income_ratio'] = data_csv['loan_amount'] / data_csv['annual_income']

# Average monthly transactions
data_csv['avg_monthly_transactions'] = data_csv['transaction_count'] / data_csv['account_creation_month']

# Spending to Income Ratio
data_csv['spending_to_income_ratio'] = data_csv['spending_ratio'] / data_csv['annual_income']

In [89]:
# Combining the transformers using ColumnTransformer
from sklearn.compose import ColumnTransformer
column_transformer = ColumnTransformer(
    transformers=[
        ('log', log_transformer, ['annual_income']),
        ('reci', reci_transformer, ['credit_score']),
        ('sqrt', sqrt_transformer, ['age']),
        ('boxcox', boxcox_transformer, ['loan_amount']),
        ('yeo_johnson', yeo_johnson_transformer, ['spending_ratio'])
    ],
    remainder='passthrough'
)

In [90]:
# Applying transformations
tranformed_data_csv = column_transformer.fit_transform(data_csv)

  result = func(self.values, **kwargs)


In [91]:
# Creating the new transformed DataFrame
final_df = pd.DataFrame(tranformed_data_csv, columns=data_csv.columns)

- **Part H: Final Deliverable**

- **14. Provide a final cleaned and transformed dataset.**

In [93]:
# Providing a final cleaned and transformed dataset
final_df.to_csv("cleaned_transformed_credit_dataset.csv", index=False)

print("Final cleaned dataset saved as 'cleaned_transformed_credit_dataset.csv'")
print("Final Shape: ", final_df.shape)

Final cleaned dataset saved as 'cleaned_transformed_credit_dataset.csv'
Final Shape:  (480, 35)


- **15. Write a report summarizing:**
    - Missing value strategies used and their effectiveness.
    - Outlier handling results.
    - Encoding methods applied to categorical/numerical variables.
    - Scaling transformations applied and why.
    - Newly engineered features and their usefulness.
    - Final dataset shape and readiness for ML modeling.

In [94]:
print("The report is ready and it is saved as in WORD file!")
print("Data processing and feature engineering completed successfully.")

The report is ready and it is saved as in WORD file!
Data processing and feature engineering completed successfully.
