________________________________________________________________________________

# **Group Assignment**
________________________________________________________________________________

## The context and understanding the problem

- Credit risk is the probability of a financial loss resulting from a borrower's **failure to repay a loan**. In other word it refers to the potential loss that a lender or investor may incur due to the failure of a borrower or counterparty **to fulfill their financial obligations**.



- Likewise, Imagine you lend $500 to your friend to buy a new phone. Your friend promises to pay you back in a month. However, when the due date arrives, your friend doesn't have the money to repay you. This situation represents credit risk ,the risk that your friend won't be able to fulfill their repayment obligation, resulting in a loss for you as the lender.

- Credit risk encompasses the chance that a lender or investor may not receive the owed principal and interest due to a borrower’s inability to repay a loan also affecting the time value of money

- IFRS 9 is an International Financial Reporting Standard issued in 2018 by the International Accounting Standards Board (IASB) that addresses the accounting for financial instruments. It provides guidance on the classification, measurement, impairment, and hedge accounting of financial instruments, including those related to credit risk. This Standard replaces IAS 39 Financial Instruments : Recognition and Measurement

- It requires entities to use ‘Expected Credit Losses’ (ECLs) to recognize impairment losses on financial assets, while IAS 39 requires entities to use ‘Incurred Losses’. The use of ECLs allows entities to recognize losses earlier and can lead to more timely and accurate recognition of credit losses. This means that the finacial entities has to develop the models that predict the expected losses and make provisions for the losses to cater for credit risk which result from default. This made the calculation of probability of default a crucial component in the financial field.





### Probability of default  

- Refers to the one of the key metric used to identify the creditworthiness of a customer. It helps us to estimate the chances that customer would make payments on time or would remain solvent during the period of mortgage.

- According to Genesis (2018) state that, the probability of default is one of the most important risk parameters estimated in credit institutions, especially banks, and plays a major role in credit risk analysis and management. Given the fact that one of the fundamental activities of banks is granting loans, the banking industry places a great deal of emphasis on credit risk. Also, the probability of default estimates should be unbiased  meaning that the best estimate PD i.e. PD should accurately predict number of defaults. However it is crucial for the financial entities to accurately come up with the correct estimates for the default probabilities to asses the chances of finacial loss so as to cushion for that. This research will implement different classifiers aganist the base line logistic regression model and compare their performance in ability to classify default client and non default client from the given retail dataset.

## Explanation of Data, Datasource and Features in the data

- Historical loan data from one of the Zimbabwe microfinance companies was used to develop the model it spread through one financial year of 2023.
- The data consist of **21 columns** , **100 000** entries of loan information and target variable which is the **Loan Status** consist of defaults and non-defaults.
- The data was sourced from one of Zimbabwe microfinance companies which their has not disclosed for security reasons on client information.
- The data consist  of borrower various features that may impact the probability of default has been selected which are explained below:



| Column                         | Description                                                                                                    |
|--------------------------------:|----------------------------------------------------------------------------------------------------------------|
| loan_id                        | Unique identifier for each loan application or record.                                                          |
| gender                         | Borrower's gender, such as "Male," "Female," or "other".                                                        |
| disbursemet_date               | The date the loan was disbursed, stored as a string (e.g., "YYYY-MM-DD").                                       |
| currency                       | Currency of the loan in USD                                                                                     |
| country                        | Borrower's country of residence.                                                                                |
| sex                            | Another representation of the borrower's sex, possibly duplicating `gender`.                                    |
| is_employed                    | Indicates if the borrower is employed (True or False).                                                          |
| job                            | Borrower's job or profession.                                                                                   |
| location                       | Borrower's location                                                                                             |
| loan_amount                    | Amount of the loan disbursed                                                                                    |
| number_of_defaults             | Number of times the borrower has defaulted on loans in the past.                                                |
| outstanding_balance            | Remaining balance on the loan                                                                                   |
| interest_rate                  | Interest rate of the loan.                                                                                      |
| age                            | Borrower's age in years.                                                                                        |
| number_of_defaults.1           | Likely a duplicate of number_of_defaults.                                                                       |
| remaining term                 | Remaining time for loan repayment.                                                                              |
| salary                         | Borrower's salary.                                                                                              |
| marital_status                 | Borrower's marital status expressed as "Single," "Married," or "Divorced."                                      |
| age.1                          | Another representation of age, possibly duplicating age.                                                        |
| Loan Status                    | Current status of the loan (defaulted or did not default)                                                       |


## Importing libraries

In [2]:
#Data manipulation
import pandas as pd
import numpy as np

#Removing unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

#Exploratory Data Analysis and Plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
sns.set(style="white")


#Machine learning
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


#Saving the model
from joblib import dump, load



#Model evaluation
from sklearn.metrics import classification_report, precision_recall_fscore_support, f1_score
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, brier_score_loss
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, precision_recall_fscore_support, f1_score
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, brier_score_loss
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split

## Loading data 

#Connecting the google drive and the colab notebook
from google.colab import drive
drive.mount('/content/drive')

#Loading data
df = pd.read_csv(\path of dataset from drive)

In [3]:
df = pd. read_csv('credit risk zimbabwe.csv')

## Data Preprocessing

In [4]:
#Loading the first five rows in the data
df.