# COGS 118A - Final Project

# Insert title here

## Group members

- Andy Chow
- Naomi Chin
- Andrew Lona
- Jiaqi Liu

# Abstract 

This project aims to make a model that can find the underlying factors that contribute to a person's financial health and how different demographics may affect the financial health. We will be using the UCI bank marketing data set to obtain demographic information. We will be using Support Vector Machine model and K Nearest Neighbors multiclass classification to predict bank account balances, with the possibly of using Principle Component Analysis to determine the most significant contributors to financial status. We will be using confusion matrix, F1 score, ROC-AUC score, learning curve, and AIC and BIC score comparison to evaluate model performance.

# Background

Making money and having a large bank account balance is a goal that many strive for. But what factors contribute to how much money people have in their accounts? There's the obvious job type and age that typically have an impact on salary and thus bank account balnce, but what other attributes have an influence, and to what degree?

There have been studies on how different demographics affect salary or bank account balance. For example, an article by ValuePenguin breaks down how income and age affect balances. Unsuprisingly, as income increases, the average balance increases. In addition, balance increases as age increases up to the 65-74 age range, but after 75+ years, the average balance decreases <a name="moon"></a>[<sup>[1]</sup>](#moonnote).
There are also a multitude of studies that have looked at how different personal attributes affect earnings. While earnings are not exactly bank account balance, we have already seen how income affect the balance. A report by Social Security found that men with bachelor's degrees earn about \\$900,000 more during their lifetime than men with only high school diplomas. For women, the difference is about \\$630,000. The report then took into account certain socio-demographic variables that could influence earnings; after recalculations, men and women with bachelor's  degrees earn \\$655,000 and \\$450,000, respectively, more than their high school graduate counterparts <a name="social"></a>[<sup>[2]</sup>](#socialnote). Further, Social Security looks into how savings are affected by marital status. It was found that married people were much more likely to have an individual retirement account (IRA) or defined contribution (DC). This likely means that married people are better at saving money due to multiple reasons. Cost sharing, long term commitment, and future-focused behavior may be contributers to this behavior <a name="relationship"></a>[<sup>[3]</sup>](#relationshipnote).

We are going to analyze not only what attributes collectively contribute to bank account balance, but also how much influence each attribute has. The goal of our research and analysis is to find what demographics and lifestyle choices correlate to bank account balances. While the results of this research are intended to be informative knowledge, they could serve as guidance to increase bank account balance. This information could also be used for banks to determine what clients to focus on.

# Problem Statement

The problem we hope to solve is, how well can a person’s demographic information predict their financial health? In this instance, financial health is determined by the balance of a person’s bank account (high is good, low is bad). Although a single bank account balance cannot directly indicate a person’s overall financial health, it is a reasonable assumption that a higher bank account balance correlates to better financial health. Our analysis will first determine which aspect of a person’s demographic information has the most influence on bank account balance through a Principal Component Analysis (PCA). Then we will determine how well a person’s age, job type, marital status, education level, credit in default status, housing loan holder status, and loan holder status contribute towards determining a person’s bank account balance. To predict the balances, we will compare Support Vector Machine (SVM) and K Nearest Neighbors (KNN) models for multiclass classification.

# Data

The data that will be used is the UCI bank marketing data set. It gives information of direct marketing campaigns of a Portuguese banking institution.

Data link: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
- 17 of variables, 45211 observations
- each observation consists of one marketing call from a Portuguese banking institution to assess if the bank term deposit would be subscribed (by the client). These outcomes may not always results in subscription, and follow-up calls are sometimes made to the same client.
- critical variables are
    - balance: at the moment we are acting on assumption, but it should be the current account balance of the client contacted
    - independent variables such as (numerical age, job type, marital status, education level, credit in default status, housing loan holder status, and loan holder status)
- cleaning/transformations
    - the data is pretty spotless in a wrangling sense, albeit most of the categorical variables do have an unknown or nonexistent value which will need to be accounted for.
    - we will also need to one-hot encode all categorical IVs such as marital status and job type (not dummy coding).


# Proposed Solution

We are interested in predicting the bank account balance of individuals given their demographic data provided by the bank. Since the account balance variable is independent from other observations, we can convert the continuous range into categorical bins based on quartile ranges with the addition of "below 0".


We will first perform a PCA to decorrelate the inputs to analyze the unique contributions from each variable. This will help us to determine what features are the most significant contributors to bank account balance.


We will then use a SVM model to conduct the multiple classification. In addition, preliminary observation of the target feature reveals that the mean is significantly smaller than the median, which indicates that there are outliers in the dataset. We will be using a ElasticNet regularization primarily to limit the influence of outliers in the L1 term but the L2 term also helps in reduction of the 17 present features. We will conduct a gradient descent with respect to the $\alpha$ term of the ElasticNet regularization since we are not completely sure how much to penalize for outliers. We will also apply a simple model using K Nearest Neighbors multiclass classification, and apply a model comparison metric to evaluate performance. We will use a Confusion Matrix and ROC-AUC analysis for model benchmarking to compare the SVM and KNN models.


We plan to use Pandas for data preprocessing and the modules provided in the SKLearn library to conduct our model implementation.

# Evaluation Metrics

Since we are dealing with a multiclass problem, in order to complete a general performance evaluation of models, we will convert our classes into multiple one-vs-rest classifications. For each instance, we will take each bank account bin as the positive class and all of the other bins as the negative class. So we will have n binary classifications where n is the number of bank account bins.


From there, we can create a confusion matrix for each on-vs-rest classification. The true and predicted classes will be used as inputs and the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) will be the outputs of the confusion matrix. We can then aggregate the TP, TN, FP, and FN from all of the classifications.


With the aggregate values, we will calculate the F1 score for each model. The F1 score is chosen because neither false positives nor false negatives are particularly penalizing to the final performance of the model. The F1 score is a function of precision and recall which account for false positives and false negatives, respectively.


$$
Precision = \frac{TP}{TP+FP} \\
Recall = \frac{TP}{TP+FN} \\
F1 = \frac{2*Precision*Recall}{Precision+Recall}
$$


We will also be calculating a Receiver Operating Characteristic Area Under the Curve (ROC-AUC) score to compare the SVM and KNN models. The ROC-AUC is a good evaluation of a model because it is a measurement of how well a model can distinguish between positive and negative classes. In our case, how well a model completes the multiclass classification. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different classification thresholds. The ROC-AUC score ranges from 0 to 1; the larger the score, the better the performance of the model.


We will also use a learning curve to evaluate over or underfitting of the model. Finally we will include an Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) score comparison to compare models. These scores measure the relative quality of a model for a given dataset. The scores are calculated with the number of model parameters (k), the maximum value of the likelihood function (L), and the number of observations in the data.


$$
AIC = 2k-2ln(L) \\
BIC = kln(n)-2ln(L)
$$

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

The main concern with prediction with real life data is the accuracies of our models. We are uncertain about the predictive power of the models and whether these models should be used to help with future marketing. One can misuse the models trained and infer possibly inaccurate conclusions on bank marketing.

Furthermore, the data used are from a Portuguese banking institution and may not be similar with maketing data of banking institutions of other countries. The marketing data of other banking institutions in Portugal might also differ. The model created might not be accurate to predict marketing results of other banking institutions.

Therefore, We are careful to disclose that the models we will generate are for reference only and we by no means guarantee they are accurate in predicting the market reception on banking. One needs to be cautious using the models trianed to predict patterns, generalize, and infer conclusions on bank marketing.

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="moonnote"></a>1.[^](#moon): Moon, Chris (14 Sept 2022) Average U.S. Checking Account Balance: A Demographic Breakdown. *ValuePenguin*. https://www.valuepenguin.com/banking/average-checking-account-balance.<br>
<a name="socialnote"></a>2.[^](#social): Social Security. Research, Statistics & Policy Analysis: Education and Lifetime Earnings. https://www.ssa.gov/policy/docs/research-summaries/education-earnings.html#:~:text=Men%20with%20bachelor's%20degrees%20earn,earnings%20than%20high%20school%20graduates.<br>
<a name="relationshipnote"></a>3.[^](#relationship): Social Security. Research, Statistics & Policy Analysis: The Relationship Between Retirement Savings and Marital Status Among Young Adults. https://www.ssa.gov/policy/docs/research-summaries/marital-status.html.<br>
