# Bank Deposit Subscription Prediction

## 1. Project Summary

This project aims to analyze and model a **Bank Marketing Dataset** from a Portuguese banking institution. The goal is to predict whether a customer will subscribe to a term deposit using customer demographic, financial, and behavioral data.

Given the dataset's size, **PySpark** is used for scalable data processing. The **XGBoost** algorithm and other machine learning models are applied for classification due to their high performance and accuracy.

###  Project Goals:
- Segment customers using demographic and behavioral features.
- Identify which customer groups are most likely to subscribe to a term deposit.
- Improve marketing efficiency and customer targeting.
- Generate business insights through data mining and predictive modeling.

###  Features in the Dataset:
- **Age**: Customer age.
- **Job**: Type of job (e.g., management, technician, retired, student, etc.).
- **Marital**: Marital status.
- **Education**: Education level.
- **Default**: Has credit in default? (yes/no).
- **Balance**: Average yearly account balance.
- **Housing**: Has housing loan? (yes/no).
- **Loan**: Has personal loan? (yes/no).
- **Contact**: Contact communication type.
- **Day**: Last contact day.
- **Month**: Last contact month.
- **Duration**: Duration of last contact (in seconds).
- **Campaign**: Number of contacts during this campaign.
- **Pdays**: Days since the client was last contacted in a previous campaign.
- **Previous**: Number of contacts before this campaign.
- **Poutcome**: Outcome of the previous marketing campaign.
- **y**: Target variable – whether the client subscribed to a term deposit (yes/no).


## 2. Instructions to Run the Code

###  Required Libraries

Make sure you have the following libraries installed:

```python
# Import library 
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import functions as F
from pyspark.sql.functions import col, count, countDistinct, desc, first
from pyspark.sql.types import IntegerType, DoubleType
from scipy.stats import kstest
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.types import ArrayType, DoubleType
import sklearn as sk
from sklearn import model_selection, ensemble, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn import tree
import re


Install missing packages :

pip install pyspark pandas scipy seaborn matplotlib scikit-learn xgboost



## 3. Team Contribution

| Member Name        | Responsibilities                             | Contribution (%) |
|--------------------|----------------------------------------------|------------------|
| Lê Ngọc Mai        | README, Slides, Report Writing               | 14.28%           |
| Nguyễn Nhật Hồng   | Modeling, Slides, Code Integration           | 14.28%           |
| Trần Quỳnh Trang   | Modeling, Slides, Code Integration           | 14.28%           |
| Phạm Thị Thảo      | Data Processing, EDA, Code Integration, README | 14.28%         |
| Phí Đình Mạnh      | Dashboard, Slides, Report Writing            | 14.28%           |
| Đỗ Phương Dung     | EDA, Slides                                  | 14.28%           |
| Lều Ngọc Minh      | EDA, Slides                                  | 14.28%           |
