# Applying Data Science Project

## Literature Review

1. Machine Learning approaches

- Random forest
Random forest is able to capture complex patterns in data, and has been used in credit scoring models in the past:
  - "A Comparison of Random Forest and Logistic Regression Models in Credit Scoring of Rural Households" - Study evaluating the performance of logistic regression and random forest models for credit scoring in rural vietnamese households. The findings indicate that the random forest model outperforms logistic regression in predictive accuracy. This suggests banks use the random forest to predict credit risk based on the existing client dataset resulting in saving time and cost to find potential clients. (https://www.researchgate.net/publication/353326832_A_COMPARISON_OF_RANDOM_FOREST_AND_LOGISTIC_REGRESSION_MODEL_IN_CREDIT_SCORING_OF_RURAL_HOUSEHOLDS)


- Overview of methods
  - "Machine Learning for Credit Risk Prediction: A Systematic Literature Review" - This review examines various machine learning methods applied to credit risk prediction. It highlights the effectiveness of deep learning models and ensemble methods over traditional statistical approaches. It also discusses the need for consistent datasets (doi = 10.20944/preprints202308.0947.v1)
  - "The Impact of Feature Selection and Transformation on Machine Learning Methods in Determining the Credit Scoring" - This paper examines how feature selection and data transformation influence the performance of machine learning models in credit scoring. It provides an extensive comparison of eight machine learning methods, analyzing their accuracy and the effects of different preprocessing techniques. 
    Main Findings:
     1. Model Performance with Feature Selection:
     - Incorporating feature selection methods enhances the predictive accuracy of machine learning models in assessing default risk.
     2. Impact of Data Scaling:
     - The combination of appropriate data scaling methods with feature selection significantly improves model performance.- Proper scaling ensures that features contribute proportionately, preventing dominance by features with larger numerical ranges.
     3. Optimal Combinations:
     - models like XGBoost and Random Forest, when paired with tailored feature selection and scaling, achieved notable performance improvements.
     4. Top-Performing Models:
     - Among the evaluated models, XGBoost and Random Forest consistently outperformed others in predicting default risk. - These models, when combined with tailored feature selection and scaling methods, achieved the highest validation indicators, including accuracy rate, Type I and II errors, and Area Under the Curve (AUC). Scaling techniques used were Standardization / Z-score normalization (This method transforms data to have a mean of zero and a standard deviation of one, ensuring that features contribute equally to the model), Min-Max Scaling (This technique scales data to a specified range, typically [0, 1], preserving the relationships between features while normalizing their scales) and Robust Scaling (This approach uses statistics that are robust to outliers, such as the median and interquartile range, to scale the data, making it less sensitive to outliers). (doi = 10.48550/arXiv.2303.05427)


Paper using taiwan dataset from github!

Data Quality Assessment:
  - The author examined the data for erroneous values, missing values, and outliers

Outlier Treatment:
The paper notes significant variability in the Taiwan dataset variables, necessitating outlier treatment. 
Several approaches were discussed:
  - Min-Max Normalization: Compressing data into a range between 0 and 1 using equation: x'i = (xi - min(x))/(max(x) - min(x))
  - Z-score Standardization: Converting variables to standard normal with mean 0, standard deviation 1
  - Winsorization: Capping extreme values at specific percentiles (e.g., 1st and 99th)

Analysis Steps:
  - The author performed variable selection using Weight of Evidence (WoE) and Information Value (IV) measures to determine which variables had predictive power
  - The dataset was split into training (70%) and testing (30%) samples
  - Multiple model types were tested, including logistic regression, random forest, gradient boosting, neural networks, and support vector machines

Results:
  - For the Taiwan dataset, gradient boosting performed best with a ROC AUC of 76.7% and PR AUC of 63.4%
  - The author found that the most important variable in predicting default was PAY_0 (recent payment status) across different modeling methods

The Taiwan dataset was one of three used in the study to show that machine learning models generally outperform traditional logistic regression for credit card default prediction, though the degree of improvement varied across datasets.

# Datasets

## Environmental datasets

1. Climate Change Knowledge Portal (CCKP) – World Bank

Description: The CCKP offers comprehensive environmental, disaster risk, and socio-economic datasets. It provides synthesis products like Climate Adaptation Country Profiles, which can help assess environmental risks affecting economic stability. It is possible to download data by country. https://climateknowledgeportal.worldbank.org/download-data

Example datasets = Population density (global), Percentage of Population below $1.90/day (global/local)

How it can be used: Could be used as an indicator of areas of poverty, which could increase credit risk.

2. Socio-Economic Status & Poverty Levels

Data Sources:
- World Bank Open Data (Poverty, GDP per capita, Unemployment rates) https://data.worldbank.org/
- UNDP Human Development Index (HDI) - based on a combination of life expectancy, education & GNI (Gross National Income) per capita. https://hdr.undp.org/data-center/human-development-index#/indicies/HDI

How it can be used: 
- Map borrowers to regions with poverty indicators or low GDP.

3. Disaster Risk & Exposure

Data Sources:

- EM-DAT (International Disaster Database) - contains data on the occurrence and impacts of over 26,000 mass disasters worldwide from 1900 to the present day. Data includes climatological (drought, wildfire), geophysical (earthquakes, volcanic activity), and hydrological data (floods, mass movement), all of which can be sorted by country or continent. https://www.emdat.be/

How it can be used:
- High-risk customers may be located in disaster-prone regions (e.g., frequent flooding, hurricanes) as natural disasters can cause financial instability/displacement.

4. Housing & Living Costs

Data Sources:
OECD Regional Statistics - contains a variety of data including housing costs as a % of disposable income & ownership rates, living standards. https://data-explorer.oecd.org/?lc=en
Numbeo (cost of living data) - includes a cost of living index by city! https://www.numbeo.com/cost-of-living/country_result.jsp?country=United+Kingdom

How it can be used:
- Factor in regional cost of living to assess debt-to-income ratios more accurately.
