#  Data Science Competition :  Predicting Probability of Default

### Problem Statements

Financial institutions face significant risks due to loan defaults. Accurately predicting the 
probability of default (PD) on loans is critical for risk management and strategic planning. In this 
competition, participants are tasked with developing a predictive model that estimates the 
probability of default on loans using historical loan data.

### Import Python libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import pandas as pd
import plotly.express as px


from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.preprocessing import label_binarize 
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from datetime import datetime, date

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
#Import SVM and KNN

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, accuracy_score, roc_auc_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from IPython.display import display
from sklearn.decomposition import PCA
from sklearn.calibration import calibration_curve, CalibratedClassifierCV

pd.options.display.max_columns=None

### Importing the required dataset

In [2]:
data = pd.read_csv('C:/Users/ntumbare/Desktop/Data_Science/data_science_competition_2024.csv') 
data = pd.DataFrame(data)

### View the first lines of the dataset

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,loan_id,gender,disbursemet_date,currency,country,sex,is_employed,job,location,loan_amount,number_of_defaults,outstanding_balance,interest_rate,age,number_of_defaults.1,remaining term,salary,marital_status,age.1,Loan Status
0,0,8d05de78-ff32-46b1-aeb5-b3190f9c158a,female,2022 10 29,USD,Zimbabwe,female,True,Teacher,Beitbridge,39000,0,48653.01147,0.22,37,0,47,3230.038869,married,37,Did not default
1,1,368bf756-fcf2-4822-9612-f445d90b485b,other,2020 06 06,USD,Zimbabwe,other,True,Teacher,Harare,27000,2,28752.06224,0.2,43,2,62,3194.139103,single,43,Did not default
2,2,6e3be39e-49b5-45b5-aab6-c6556de53c6f,other,2023 09 29,USD,Zimbabwe,other,True,Nurse,Gweru,35000,1,44797.55413,0.22,43,1,57,3330.826656,married,43,Did not default
3,3,191c62f8-2211-49fe-ba91-43556b307871,female,2022 06 22,USD,Zimbabwe,female,True,Doctor,Rusape,24000,0,35681.49641,0.23,47,0,42,2246.79702,divorced,47,Did not default
4,4,477cd8a1-3b01-4623-9318-8cd6122a8346,male,2023 02 08,USD,Zimbabwe,male,True,Nurse,Chipinge,19000,0,34156.05588,0.2,42,0,45,2310.858441,married,42,Did not default


### View a list of all the variables in the data
The list printed helps in identifying repeated variables e.g number_of_defaults, age <br />
Before looking at the descriptive statistics of this data, we need to clean it first i.e droping repeated variables and dropping unneccesary variables such as loan_IDs. <br />

- Irrelevant to the Task: <br />
Unique identifiers, like a row number or customer ID, usually do not contain any useful information for the machine learning task at hand. These columns do not provide any predictive power, and including them can actually hurt the model's performance by introducing unnecessary noise.

In [4]:
columns = data.columns

print('List of Variables in Dataset:')
print('-----------------------------')
for column in columns:
    print(f'- {column}')



List of Variables in Dataset:
-----------------------------
- Unnamed: 0
- loan_id
- gender
- disbursemet_date
- currency
- country
- sex
- is_employed
- job
- location
- loan_amount
- number_of_defaults
- outstanding_balance
- interest_rate
- age
- number_of_defaults.1
- remaining term
- salary
- marital_status
- age.1
- Loan Status


### Shows count of rows and columns of the data

In [5]:
data.shape

(100000, 21)

### Looking at the data type of the variables in the data
This helps to identify missing data within the variables given that the shape of this data is 100,000 yet some variables have less than that number i.e. country, job, location.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 21 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Unnamed: 0            100000 non-null  int64  
 1   loan_id               100000 non-null  object 
 2   gender                100000 non-null  object 
 3   disbursemet_date      100000 non-null  object 
 4   currency              100000 non-null  object 
 5   country               99900 non-null   object 
 6   sex                   100000 non-null  object 
 7   is_employed           100000 non-null  bool   
 8   job                   95864 non-null   object 
 9   location              99405 non-null   object 
 10  loan_amount           100000 non-null  int64  
 11  number_of_defaults    100000 non-null  int64  
 12  outstanding_balance   100000 non-null  float64
 13  interest_rate         100000 non-null  float64
 14  age                   100000 non-null  int64  
 15  n

### Data Cleaning
Data cleaning is easier when we are exploring variable  by variable so that none will be skipped in the process <br />
The first step is exploring its response variable under the follwing inconsistences:
- Repeated variables  <br />
- Missing values  <br />
- leading/trailing spaces <br />
- Inconsistent labells   <br />
- dropping unnecessary variables such as unique identifiers

### 1. Unnamed: 0
This varaiable is a counting variable and is unnamed/unlabelled, hence need to be dropped as it is of no significant use.<br />
Drop the first unlabeled column


In [7]:
data = data.drop(data.columns[0], axis=1)

### View the data information after dropping the first variable

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_id               100000 non-null  object 
 1   gender                100000 non-null  object 
 2   disbursemet_date      100000 non-null  object 
 3   currency              100000 non-null  object 
 4   country               99900 non-null   object 
 5   sex                   100000 non-null  object 
 6   is_employed           100000 non-null  bool   
 7   job                   95864 non-null   object 
 8   location              99405 non-null   object 
 9   loan_amount           100000 non-null  int64  
 10  number_of_defaults    100000 non-null  int64  
 11  outstanding_balance   100000 non-null  float64
 12  interest_rate         100000 non-null  float64
 13  age                   100000 non-null  int64  
 14  number_of_defaults.1  100000 non-null  int64  
 15  r

### 2.   loan_id 
The loan_id is excluded when determmining the trainig and test set data as well as in reporting descriptive statistics as it is of no significant value in predicting default.

### 3. gender  

This variable has no inconsistences or missing values detected with 3 response categories i.e. female, other, male

In [9]:
print(data['gender'].unique())

['female' 'other' 'male']


### 4. disbursemet_date
This variable was checked for any inconsistences. 
Rather was determined not so useful on its own since the machine will never see these historical years in the future so it is not useful to train the model on historical dates it will not see in future rather was used to create new variables. <br />

The new variable that was created was 'loan_length'. <br />
 
This variable trade-off speaks to the duration or time period in months at which the loan existed in the loan portfolio books i.e the time in months elapsed between the date the loan was disbursed and the date when the PDs were predicted, calculated as:   <br />

   loan_length = PD_prediction_date - disbursement_date  <br />
   
The unit of choice as months was used because months provide a more precise measure of the loan length, especially for shorter loans. In this case the loans are all below 5 years since disbursement and some loans were disbursed in 2023, less than a year in the credit portfolio. <br />

The 'PD_prediction_date' was note stated or given, hence an assumption was made for the creation of this additional variable. <br />

Assumption: <br />

- Assume that the PDs were predicted as at 30 June 2024  <br />
Hence the formular for 'loan_length' would be:    <br />
           loan_length = 30 June 2024 - disbursement_date  <br />

The disbursement date was dropped.


In [10]:

# Convert the 'disbursemet_date' to datetime
data['disbursemet_date'] = pd.to_datetime(data['disbursemet_date'], format='%Y %m %d')

# Calculate the 'loan_length' in months
current_date = date(2024, 6, 30)
data['loan_length'] = [(current_date.year * 12 + current_date.month) - (d.year * 12 + d.month) for d in data['disbursemet_date']]

# Create the new DataFrame
data = pd.DataFrame(data)

print(data)

                                    loan_id  gender disbursemet_date currency  \
0      8d05de78-ff32-46b1-aeb5-b3190f9c158a  female       2022-10-29      USD   
1      368bf756-fcf2-4822-9612-f445d90b485b   other       2020-06-06      USD   
2      6e3be39e-49b5-45b5-aab6-c6556de53c6f   other       2023-09-29      USD   
3      191c62f8-2211-49fe-ba91-43556b307871  female       2022-06-22      USD   
4      477cd8a1-3b01-4623-9318-8cd6122a8346    male       2023-02-08      USD   
...                                     ...     ...              ...      ...   
99995  41000f4b-3821-4dea-90e1-9ecf591ed1c0    male       2021-10-20      USD   
99996  507c2a45-02fa-4aa0-854a-8947a865a7ea   other       2023-06-11      USD   
99997  4f10e845-8f75-4cd5-9f3a-3dad3e04a483  female       2021-10-20      USD   
99998  eded01ca-79d2-4e86-a1e3-2ea1354edca7    male       2021-08-22      USD   
99999  a37561ec-0901-4350-8a13-634f80ece55d   other       2022-04-29      USD   

        country     sex  is

In [11]:
data = data.drop(data.columns[2], axis=1)

In [12]:
print(data)

                                    loan_id  gender currency   country  \
0      8d05de78-ff32-46b1-aeb5-b3190f9c158a  female      USD  Zimbabwe   
1      368bf756-fcf2-4822-9612-f445d90b485b   other      USD  Zimbabwe   
2      6e3be39e-49b5-45b5-aab6-c6556de53c6f   other      USD  Zimbabwe   
3      191c62f8-2211-49fe-ba91-43556b307871  female      USD  Zimbabwe   
4      477cd8a1-3b01-4623-9318-8cd6122a8346    male      USD  Zimbabwe   
...                                     ...     ...      ...       ...   
99995  41000f4b-3821-4dea-90e1-9ecf591ed1c0    male      USD  Zimbabwe   
99996  507c2a45-02fa-4aa0-854a-8947a865a7ea   other      USD  Zimbabwe   
99997  4f10e845-8f75-4cd5-9f3a-3dad3e04a483  female      USD  Zimbabwe   
99998  eded01ca-79d2-4e86-a1e3-2ea1354edca7    male      USD  Zimbabwe   
99999  a37561ec-0901-4350-8a13-634f80ece55d   other      USD  Zimbabwe   

          sex  is_employed           job     location  loan_amount  \
0      female         True       Teacher 

### 5. currency
The currency variable is deemed not important and is not providing any additional information to the analysis, as it is a constant. <br />
However, if I you had more variation in the 'currency' column e.g., 'USD', 'EUR', 'GBP', then it could be an important predictor to include in modelling. <br />
Hence the currency variable is dropped.

In [13]:
data = data.drop(data.columns[2], axis=1)
print(data)

                                    loan_id  gender   country     sex  \
0      8d05de78-ff32-46b1-aeb5-b3190f9c158a  female  Zimbabwe  female   
1      368bf756-fcf2-4822-9612-f445d90b485b   other  Zimbabwe   other   
2      6e3be39e-49b5-45b5-aab6-c6556de53c6f   other  Zimbabwe   other   
3      191c62f8-2211-49fe-ba91-43556b307871  female  Zimbabwe  female   
4      477cd8a1-3b01-4623-9318-8cd6122a8346    male  Zimbabwe    male   
...                                     ...     ...       ...     ...   
99995  41000f4b-3821-4dea-90e1-9ecf591ed1c0    male  Zimbabwe    male   
99996  507c2a45-02fa-4aa0-854a-8947a865a7ea   other  Zimbabwe   other   
99997  4f10e845-8f75-4cd5-9f3a-3dad3e04a483  female  Zimbabwe  female   
99998  eded01ca-79d2-4e86-a1e3-2ea1354edca7    male  Zimbabwe    male   
99999  a37561ec-0901-4350-8a13-634f80ece55d   other  Zimbabwe   other   

       is_employed           job     location  loan_amount  \
0             True       Teacher   Beitbridge        39000   

### 6. country
The country variable is deemed not important and is not providing any additional information to the analysis, and is a constant. <br />
If I you had more country variation in the country column e.g. Zmbia, South Africa, then it could be an important predictor to include in modelling.<br />
Hence the country variable is dropped.

In [14]:
#View this country variable

print(pd.unique(data['country']))

['Zimbabwe' 'zimbabwe' 'Zim' nan]


In [15]:
data = data.drop(data.columns[2], axis=1)
print(data)

                                    loan_id  gender     sex  is_employed  \
0      8d05de78-ff32-46b1-aeb5-b3190f9c158a  female  female         True   
1      368bf756-fcf2-4822-9612-f445d90b485b   other   other         True   
2      6e3be39e-49b5-45b5-aab6-c6556de53c6f   other   other         True   
3      191c62f8-2211-49fe-ba91-43556b307871  female  female         True   
4      477cd8a1-3b01-4623-9318-8cd6122a8346    male    male         True   
...                                     ...     ...     ...          ...   
99995  41000f4b-3821-4dea-90e1-9ecf591ed1c0    male    male        False   
99996  507c2a45-02fa-4aa0-854a-8947a865a7ea   other   other         True   
99997  4f10e845-8f75-4cd5-9f3a-3dad3e04a483  female  female         True   
99998  eded01ca-79d2-4e86-a1e3-2ea1354edca7    male    male         True   
99999  a37561ec-0901-4350-8a13-634f80ece55d   other   other         True   

                job     location  loan_amount  number_of_defaults  \
0           Teache

### 7. sex
The sex variable was repeated from the gender variable, hence it was instantly dropped.

In [16]:
data = data.drop(data.columns[2], axis=1)
print(data)

                                    loan_id  gender  is_employed  \
0      8d05de78-ff32-46b1-aeb5-b3190f9c158a  female         True   
1      368bf756-fcf2-4822-9612-f445d90b485b   other         True   
2      6e3be39e-49b5-45b5-aab6-c6556de53c6f   other         True   
3      191c62f8-2211-49fe-ba91-43556b307871  female         True   
4      477cd8a1-3b01-4623-9318-8cd6122a8346    male         True   
...                                     ...     ...          ...   
99995  41000f4b-3821-4dea-90e1-9ecf591ed1c0    male        False   
99996  507c2a45-02fa-4aa0-854a-8947a865a7ea   other         True   
99997  4f10e845-8f75-4cd5-9f3a-3dad3e04a483  female         True   
99998  eded01ca-79d2-4e86-a1e3-2ea1354edca7    male         True   
99999  a37561ec-0901-4350-8a13-634f80ece55d   other         True   

                job     location  loan_amount  number_of_defaults  \
0           Teacher   Beitbridge        39000                   0   
1           Teacher       Harare        27000

### 8. is_employed
This variable was checked for inconsistences and none were found including missind data, errors in labels etc

In [17]:
print(data['is_employed'].unique())

[ True False]


### 9. job
This variable consisted of more than 4,000 entries with missing data. <br />
Besides missing data the response labels were not inconsistent e.g 'Software Developer' vs 'SoftwareDeveloper' and  'Data Scintist' vs 'Data Scientist' <br />
These missing values were filled using the 'is_emplyed' variable. <br />
It was found that all the missing values on 'job' were not employed under the 'is_employed' variable i.e they were all 'FALSE' <br />
The 'is_emplyed' variable determined the replacements on the 'job' variable as follows: <br />
- if 'is_employed' is 'FALSE' the 'job' 'Unemployed'

In [18]:
print(data['job'].unique())

['Teacher' 'Nurse' 'Doctor' 'Data Analyst' 'Software Developer'
 'Accountant' 'Lawyer' 'Engineer' nan 'Data Scientist' 'SoftwareDeveloper'
 'Data Scintist']


In [19]:
# Standardize job titles
data['job'] = data['job'].str.title()
data['job'] = data['job'].replace('Softwaredeveloper', 'Software Developer')
data['job'] = data['job'].replace('Data Scintist', 'Data Scientist')

# Handle missing values
data['job'] = data['job'].fillna('Unemployed')

# Create a list of unique job titles
unique_jobs = data['job'].unique()
print(unique_jobs)

['Teacher' 'Nurse' 'Doctor' 'Data Analyst' 'Software Developer'
 'Accountant' 'Lawyer' 'Engineer' 'Unemployed' 'Data Scientist']


### 10. location
This variable was checked for inconsistences and some were found. <br />
The variable had inconsistent response labelling, were the responses had leading spaces hence the names were repeated e.g 'Gweru', ' Gweru   ', ' Gweru '. <br />

In [20]:
print(data['location'].unique())

['Beitbridge' 'Harare' 'Gweru' 'Rusape' 'Chipinge' 'Chimanimani'
 'Marondera' 'Kadoma' 'Mutare' 'Masvingo' 'Bulawayo' 'Kariba' 'Plumtree'
 'Chiredzi' 'Shurugwi' 'Chivhu' 'Zvishavane' 'Nyanga' 'Karoi' 'Redcliff'
 'Kwekwe' ' Karoi ' 'Gokwe' 'Victoria Falls' ' Masvingo ' '   Chipinge   '
 ' Mutare ' nan '   Mutare ' ' Marondera   ' '   Rusape   ' ' Bulawayo   '
 'Chivhu ' ' Chimanimani   ' 'Plumtree   ' '   Masvingo   ' '   Gweru '
 '   Chivhu   ' 'Mutare   ' ' Kwekwe ' 'Marondera   ' ' Chipinge   '
 '   Mutare   ' '   Karoi   ' ' Beitbridge   ' '   Karoi ' ' Beitbridge '
 ' Mutare   ' '   Bulawayo ' 'Masvingo   ' ' Kadoma   ' ' Plumtree '
 'Marondera ' '   Plumtree ' ' Chipinge ' '   Harare ' 'Harare   '
 ' Nyanga   ' ' Gweru   ' 'Rusape   ' 'Masvingo ' '   Harare   '
 ' Kadoma ' 'Bulawayo   ' ' Kwekwe   ' 'Hwange' ' Harare '
 '   Marondera   ' 'Chipinge   ' '   Marondera ' '   Beitbridge '
 'Karoi   ' 'Chimanimani ' ' Bulawayo ' 'Chivhu   ' 'Kwekwe ' ' Kariba   '
 ' Marondera ' 'Plumtre

### Removing leading/trailing spaces on the response variables
These leading spaces affect the uniquness of responses in my 'location' variable 

In [21]:
data['location'] = data['location'].str.strip()

### View the unique locations
The location variable also has blanks labelled 'nan'

In [22]:
print(pd.unique(data['location']))

['Beitbridge' 'Harare' 'Gweru' 'Rusape' 'Chipinge' 'Chimanimani'
 'Marondera' 'Kadoma' 'Mutare' 'Masvingo' 'Bulawayo' 'Kariba' 'Plumtree'
 'Chiredzi' 'Shurugwi' 'Chivhu' 'Zvishavane' 'Nyanga' 'Karoi' 'Redcliff'
 'Kwekwe' 'Gokwe' 'Victoria Falls' nan 'Hwange']


### Missing data in the 'location' variable
Fill missing values with 'Not_specified' in the location variable for completeness. <br />
The variable is quite important since it reflects the level of default as per clients respective location/city of residence <br />
The economic conditions and performance of a specific region or city can have a significant impact on the financial health and creditworthiness of individuals and businesses located there. Factors like employment rates, income levels, industry composition, and economic growth can vary across different locations and influence the default risk. <br />
Cost of Living: <br />
The cost of living in a particular city or region can affect the financial stability of individuals and businesses. <br />
Higher costs of living, such as housing, utilities, and other expenses, can put more strain on their financial resources and increase the likelihood of default.<br />

In [23]:
data['location'] = data['location'].fillna('Not_specified')

# Display the updated DataFrame
print(data)

                                    loan_id  gender  is_employed  \
0      8d05de78-ff32-46b1-aeb5-b3190f9c158a  female         True   
1      368bf756-fcf2-4822-9612-f445d90b485b   other         True   
2      6e3be39e-49b5-45b5-aab6-c6556de53c6f   other         True   
3      191c62f8-2211-49fe-ba91-43556b307871  female         True   
4      477cd8a1-3b01-4623-9318-8cd6122a8346    male         True   
...                                     ...     ...          ...   
99995  41000f4b-3821-4dea-90e1-9ecf591ed1c0    male        False   
99996  507c2a45-02fa-4aa0-854a-8947a865a7ea   other         True   
99997  4f10e845-8f75-4cd5-9f3a-3dad3e04a483  female         True   
99998  eded01ca-79d2-4e86-a1e3-2ea1354edca7    male         True   
99999  a37561ec-0901-4350-8a13-634f80ece55d   other         True   

                job     location  loan_amount  number_of_defaults  \
0           Teacher   Beitbridge        39000                   0   
1           Teacher       Harare        27000

# Assigning provinces to the 'location' variable
The location cvariable was assigned city provinces to reduce dimensionality, improve generalization and robustness to sparse data in the variable. <br />
Aggregating the locations to the province level can significantly reduce the number of unique categories, making the data more manageable for the model. <br />
The province-level information may capture broader regional factors that influence default risk, such as economic conditions, infrastructure <br />
Some cities may have very few observations in the dataset, making it challenging for the model to learn reliable patterns.<br />
Grouping the data by province can help address this issue by ensuring that each province has a sufficient number of observations for the model to learn from.

In [25]:
#Define the mapping of locations to provinces
location_to_province = {
    'Beitbridge': 'Matabeleland South',
    'Harare': 'Harare',
    'Gweru': 'Midlands',
    'Rusape': 'Manicaland',
    'Chipinge': 'Manicaland',
    'Chimanimani': 'Manicaland',
    'Marondera': 'Mashonaland East',
    'Kadoma': 'Mashonaland West',
    'Mutare': 'Manicaland',
    'Masvingo': 'Masvingo',
    'Bulawayo': 'Bulawayo',
    'Kariba': 'Mashonaland West',
    'Plumtree': 'Matabeleland South',
    'Chiredzi': 'Masvingo',
    'Shurugwi': 'Midlands',
    'Chivhu': 'Mashonaland East',
    'Zvishavane': 'Midlands',
    'Nyanga': 'Manicaland',
    'Karoi': 'Mashonaland West',
    'Redcliff': 'Midlands',
    'Kwekwe': 'Midlands',
    'Gokwe': 'Midlands',
    'Victoria Falls': 'Matabeleland North',
    'Hwange': 'Matabeleland North'
}

# Replace the location values with the corresponding provinces
data['province'] = data['location'].map(location_to_province).fillna('Not_Specified')

# Display the updated DataFrame
print(data)

                                    loan_id  gender  is_employed  \
0      8d05de78-ff32-46b1-aeb5-b3190f9c158a  female         True   
1      368bf756-fcf2-4822-9612-f445d90b485b   other         True   
2      6e3be39e-49b5-45b5-aab6-c6556de53c6f   other         True   
3      191c62f8-2211-49fe-ba91-43556b307871  female         True   
4      477cd8a1-3b01-4623-9318-8cd6122a8346    male         True   
...                                     ...     ...          ...   
99995  41000f4b-3821-4dea-90e1-9ecf591ed1c0    male        False   
99996  507c2a45-02fa-4aa0-854a-8947a865a7ea   other         True   
99997  4f10e845-8f75-4cd5-9f3a-3dad3e04a483  female         True   
99998  eded01ca-79d2-4e86-a1e3-2ea1354edca7    male         True   
99999  a37561ec-0901-4350-8a13-634f80ece55d   other         True   

                job     location  loan_amount  number_of_defaults  \
0           Teacher   Beitbridge        39000                   0   
1           Teacher       Harare        27000

In [28]:
print(pd.unique(data['province']))

['Matabeleland South' 'Harare' 'Midlands' 'Manicaland' 'Mashonaland East'
 'Mashonaland West' 'Masvingo' 'Bulawayo' 'Matabeleland North'
 'Not_Specified']


### Drop the 'location' variable
Since we have used the location variable to create a new variable 'province' and hence the variable is dropped. <br />

In [26]:
data = data.drop(data.columns[4], axis=1)

In [27]:
# Display the updated DataFrame
print(data)

                                    loan_id  gender  is_employed  \
0      8d05de78-ff32-46b1-aeb5-b3190f9c158a  female         True   
1      368bf756-fcf2-4822-9612-f445d90b485b   other         True   
2      6e3be39e-49b5-45b5-aab6-c6556de53c6f   other         True   
3      191c62f8-2211-49fe-ba91-43556b307871  female         True   
4      477cd8a1-3b01-4623-9318-8cd6122a8346    male         True   
...                                     ...     ...          ...   
99995  41000f4b-3821-4dea-90e1-9ecf591ed1c0    male        False   
99996  507c2a45-02fa-4aa0-854a-8947a865a7ea   other         True   
99997  4f10e845-8f75-4cd5-9f3a-3dad3e04a483  female         True   
99998  eded01ca-79d2-4e86-a1e3-2ea1354edca7    male         True   
99999  a37561ec-0901-4350-8a13-634f80ece55d   other         True   

                job  loan_amount  number_of_defaults  outstanding_balance  \
0           Teacher        39000                   0          48653.01147   
1           Teacher        27

In [12]:

#Missing values
#List the float variables with missing data

float_variables = ['outstanding_balance', 'interest_rate', 'salary']

# Performing the mean imputation for each float variable
for variable in float_variables:
    mean_value = data[variable].mean()
    data[variable] = data[variable].fillna(mean_value)
    
# This approach calculates the mean of the non-missing values in the float variable.
# Replace the missing values with the calculated mean or median.
# This approach assumes that the missing values are missing at random and can be reasonably approximated by the 
# central tendency of the observed data.
# Replacing missing values with the mean or median helps preserve the overall distribution of the variable, 
# as the imputed values are based on the observed data.
# This can be important for maintaining the statistical properties of the variable, such as its central tendency
# and spread, which may be important for certain types of analysis or modeling.

In [18]:
#Missing values on the object data types
#Country

# This replacement was done in relation to the location, it was found that the loacations pf these missing 'couyntry' variables
# were all in Zimbabwe, hence replaced as Zimbabwe

# View the unique locations
print(data['location'].unique())

['Beitbridge' 'Harare' 'Gweru' 'Rusape' 'Chipinge' 'Chimanimani'
 'Marondera' 'Kadoma' 'Mutare' 'Masvingo' 'Bulawayo' 'Kariba' 'Plumtree'
 'Chiredzi' 'Shurugwi' 'Chivhu' 'Zvishavane' 'Nyanga' 'Karoi' 'Redcliff'
 'Kwekwe' ' Karoi ' 'Gokwe' 'Victoria Falls' ' Masvingo ' '   Chipinge   '
 ' Mutare ' nan '   Mutare ' ' Marondera   ' '   Rusape   ' ' Bulawayo   '
 'Chivhu ' ' Chimanimani   ' 'Plumtree   ' '   Masvingo   ' '   Gweru '
 '   Chivhu   ' 'Mutare   ' ' Kwekwe ' 'Marondera   ' ' Chipinge   '
 '   Mutare   ' '   Karoi   ' ' Beitbridge   ' '   Karoi ' ' Beitbridge '
 ' Mutare   ' '   Bulawayo ' 'Masvingo   ' ' Kadoma   ' ' Plumtree '
 'Marondera ' '   Plumtree ' ' Chipinge ' '   Harare ' 'Harare   '
 ' Nyanga   ' ' Gweru   ' 'Rusape   ' 'Masvingo ' '   Harare   '
 ' Kadoma ' 'Bulawayo   ' ' Kwekwe   ' 'Hwange' ' Harare '
 '   Marondera   ' 'Chipinge   ' '   Marondera ' '   Beitbridge '
 'Karoi   ' 'Chimanimani ' ' Bulawayo ' 'Chivhu   ' 'Kwekwe ' ' Kariba   '
 ' Marondera ' 'Plumtre

In [19]:
# Remove leading/trailing spaces on the response variables
#These leading spaces affect the uniquness of my responses in my location variable

# Remove leading/trailing spaces
data['location'] = data['location'].str.strip()

# View the unique locations
print(pd.unique(data['location']))

#The location variable also has blanks labelled 'nan'

['Beitbridge' 'Harare' 'Gweru' 'Rusape' 'Chipinge' 'Chimanimani'
 'Marondera' 'Kadoma' 'Mutare' 'Masvingo' 'Bulawayo' 'Kariba' 'Plumtree'
 'Chiredzi' 'Shurugwi' 'Chivhu' 'Zvishavane' 'Nyanga' 'Karoi' 'Redcliff'
 'Kwekwe' 'Gokwe' 'Victoria Falls' nan 'Hwange']


In [20]:
# Replace missing 'country' values with 'Zimbabwe' if the 'location' is in Zimbabwe
data['country'] = data.loc[data['location'].isin(['Beitbridge', 'Harare', 'Gweru', 'Rusape', 'Chipinge', 'Chimanimani',
 'Marondera', 'Kadoma', 'Mutare', 'Masvingo', 'Bulawayo', 'Kariba', 'Plumtree',
 'Chiredzi', 'Shurugwi', 'Chivhu', 'Zvishavane', 'Nyanga', 'Karoi', 'Redcliff',
 'Kwekwe', 'Gokwe', 'Victoria Falls', 'Hwange']), 'country'].fillna('Zimbabwe')

In [21]:
#The country varible has been replaced with the respective coutry name using location
#View this country variable
# View the unique locations
print(pd.unique(data['country']))

['Zimbabwe' 'zimbabwe' 'Zim' nan]


In [23]:
#We still the 'nan' in the data , hence have made the following assumption
# Assume that all the clients in the portfolio are from Zimbabwe
# Replacing the 'nan' value with Zimbabwe
data['country'] = data['country'].fillna('Zimbabwe')
print(pd.unique(data['country']))

['Zimbabwe' 'zimbabwe' 'Zim']


In [25]:
# Standardize the country and location values to "Zimbabwe"

data['country'] = data['country'].str.capitalize()
data['country'] = data['country'].replace(['Zimbabwe', 'Zim'], 'Zimbabwe')

print(pd.unique(data['country']))

['Zimbabwe']


In [26]:
#Location
# View the unique locations
print(pd.unique(data['location']))

['Beitbridge' 'Harare' 'Gweru' 'Rusape' 'Chipinge' 'Chimanimani'
 'Marondera' 'Kadoma' 'Mutare' 'Masvingo' 'Bulawayo' 'Kariba' 'Plumtree'
 'Chiredzi' 'Shurugwi' 'Chivhu' 'Zvishavane' 'Nyanga' 'Karoi' 'Redcliff'
 'Kwekwe' 'Gokwe' 'Victoria Falls' nan 'Hwange']


In [27]:
#The location variable has missing values as blanks
#Hence the variable is dropped because it is better to drop the column 'location' than to drop the entire rows affected
# by missing 'location' values which would reduce sample size of the data as we require large amounts of data for our
# training and testing for a better performning model.
# Drop the 'location' data

data = data.drop('location', axis=1)  

In [13]:
#View the new data with replaced values
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 21 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Unnamed: 0            100000 non-null  int64  
 1   loan_id               100000 non-null  object 
 2   gender                100000 non-null  object 
 3   disbursemet_date      100000 non-null  object 
 4   currency              100000 non-null  object 
 5   country               99900 non-null   object 
 6   sex                   100000 non-null  object 
 7   is_employed           100000 non-null  bool   
 8   job                   95864 non-null   object 
 9   location              99405 non-null   object 
 10  loan_amount           100000 non-null  int64  
 11  number_of_defaults    100000 non-null  int64  
 12  outstanding_balance   100000 non-null  float64
 13  interest_rate         100000 non-null  float64
 14  age                   100000 non-null  int64  
 15  n