# Pre-processing code for both datasets

adapted from [here](https://www.kaggle.com/code/caleones/ml-for-early-detection-of-heart-disease) and [here](https://www.kaggle.com/code/hamza062/heart-disease-prediction-ml-88-accuracy#1.-Import-Libraries)

Description of the dataset (also sourced from the two referenced notebooks)
This dataset is a reduced version of a larger heart disease prediction dataset 
id: A unique identifier for each patient.
The features are as follows
1. age: The patient's age in years. ranging from 
2. sex: The patient's sex (male or female).
3. dataset: The dataset the patient's record came from (e.g., Cleveland). This is not used in our predictive model 
4. cp: Type of chest pain. This indicates what kind of chest pain the patient presented with 
5. trestbps: Resting blood pressure. A high resting blood pressure is indicative of hyper tension
6. chol: Serum cholesterol in mg/dl. A high cholesterol level 
7. fbs: Fasting blood sugar > 120 mg/dl. A high fasting blood sugar typically is indicative of diabetes,  which correlates with heart issues
8. restecg: Resting electrocardiographic results.
9. thalch: Maximum heart rate achieved.
10. exang: Exercise induced angina.
11. oldpeak: ST depression induced by exercise relative to rest.
12. slope: The slope of the peak exercise ST segment.
13. ca: Number of major vessels (0-3) colored by flourosopy.
14. thal: A thalium stress result.

 Target : num: The presence of heart disease (the predicted attribute)


 This dataset has the following missing values for each column:

 1. Ca: 611 rows
 2. thal : 486 rows
 3. Slope : 309 rows
 4. TrestBPS : 59 rows
 5. restecg: 2 rows
 6. thalch: 55 rows
 7. Exang: 55 rows 
 8. OldPeak: 62 rows 


In [10]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder,MinMaxScaler
from sklearn.impute import KNNImputer
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score,accuracy_score
from sklearn.model_selection import train_test_split
from lightgbm import LGBMRegressor, LGBMClassifier


In [2]:
df = pd.read_csv('heart_disease_uci.csv')
df_procesed = df.drop(columns=['id'])

# apply One-Hot Encoding to categorical columns
#df_procesed = pd.get_dummies(df_procesed, columns=categorical_cols, drop_first=True)

In [3]:
df_procesed['chol'] = df_procesed['chol'].replace(0, np.nan)
num =  ['trestbps', 'chol', 'thalch', 'oldpeak'] # Numerical columns with NANS
columns_to_encode = ['sex','dataset', 'cp', 'thal', 'slope', 'exang', 'restecg', 'fbs','ca']
categ_cols = ['fbs', 'restecg', 'exang', 'slope', 'thal', 'ca']
index = {}
label_encoders = {}

for colm in columns_to_encode:
    nan_ixs = np.where(df_procesed[colm].isna())[0]
    index[colm] = nan_ixs

le = LabelEncoder()
for col in columns_to_encode:
    
    
    df_procesed[col] = le.fit_transform(df_procesed[col])
    
    label_encoders[col] = le
    for col1, idxs in index.items():
        df_procesed.loc[idxs, col1] = np.nan
    

df_procesed.sample(5)



Unnamed: 0,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
194,68,0,0,2,120.0,211.0,0.0,0.0,115.0,0.0,1.5,1.0,0.0,1.0,0
250,57,1,0,0,110.0,201.0,0.0,1.0,126.0,1.0,1.5,1.0,0.0,0.0,0
636,52,1,2,0,95.0,,,1.0,82.0,1.0,,,,,2
407,49,0,1,2,130.0,207.0,0.0,2.0,135.0,0.0,0.0,,,,0
362,43,0,1,3,100.0,223.0,0.0,1.0,142.0,0.0,0.0,,,,0


In [4]:
for col in num:
    
    df_with_missing = df_procesed[df_procesed[col].isna()]
    # dropna removes all rows with missing values
    df_without_missing = df_procesed[df_procesed[col].notna()]
    
    # split the data into X and y and we will only take the columns with no missing values
    X = df_without_missing.drop([col], axis=1)
    y = df_without_missing[col]
    
    # split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

    # Random Forest Imputation
    rf_model = LGBMRegressor(
        objective = "regression",
        metric = "rmse",
        n_estimators =  1000,
        bagging_freq = 1,subsample = 0.413103572972995, 
                             colsample_bytree = 0.5816717344110182,
                             min_data_in_leaf = 20,
                             learning_rate = 0.004730072022055302,
                             num_leaves = 364, verbose = -1 ,random_state=42)

    
    rf_model.fit(X_train, y_train)

    # evaluate the model
    y_preds = rf_model.predict(X_test)
    print("Missing Values", col, ":", str(round((df_procesed[col].isnull().sum() / len(df_procesed)) * 100, 2))+"%")
    print("MAE for Random Forest Imputation: ", mean_absolute_error(y_test, y_preds))
    print("RMSE for Random Forest Imputation: ", np.sqrt(mean_squared_error(y_test, y_preds)))
    print("R2 Score for Random Forest Imputation: ", r2_score(y_test, y_preds))
    
    y_pred = np.round(rf_model.predict(df_with_missing.drop([col], axis=1)))
    
    df_with_missing[col] = y_pred
    
    df_procesed = pd.concat([df_with_missing, df_without_missing], axis=0)




Missing Values trestbps : 6.41%
MAE for Random Forest Imputation:  12.987896160462476
RMSE for Random Forest Imputation:  16.68531235286742
R2 Score for Random Forest Imputation:  0.13074643697451305


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred


Missing Values chol : 21.96%
MAE for Random Forest Imputation:  37.67404300158583
RMSE for Random Forest Imputation:  47.5130016143418
R2 Score for Random Forest Imputation:  0.0922755480202152


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred


Missing Values thalch : 5.98%
MAE for Random Forest Imputation:  15.986669522034932
RMSE for Random Forest Imputation:  20.155850131510356
R2 Score for Random Forest Imputation:  0.43154292977164843


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred


Missing Values oldpeak : 6.74%
MAE for Random Forest Imputation:  0.5720316398754088
RMSE for Random Forest Imputation:  0.7996517380790313
R2 Score for Random Forest Imputation:  0.4925453845818921


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred


In [5]:
df_procesed.sample(5)


Unnamed: 0,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
287,58,1,0,1,125.0,220.0,0.0,1.0,144.0,0.0,0.4,1.0,,2.0,0
68,59,1,0,0,170.0,326.0,0.0,0.0,140.0,1.0,3.4,0.0,0.0,2.0,2
849,48,1,3,0,133.0,272.0,0.0,2.0,137.0,,0.0,,,,0
508,47,1,1,0,150.0,226.0,0.0,1.0,98.0,1.0,1.5,1.0,0.0,2.0,1
856,71,1,3,2,137.0,221.0,0.0,1.0,120.0,,1.0,,,,3


In [6]:
for col in categ_cols: 
    df_with_missing = df_procesed[df_procesed[col].isna()]
    # dropna removes all rows with missing values
#     df_without_missing = df.dropna()
    df_without_missing = df_procesed[df_procesed[col].notna()]
    
    # split the data into X and y and we will only take the columns with no missing values
    X = df_without_missing.drop([col], axis=1)
    y = df_without_missing[col]

    # split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=60)

    # Random Forest Imputation
    rf_model = LGBMClassifier(verbose = -1,learning_rate = 0.023021779601797816, num_leaves = 149, subsample = 0.6929884706542179, colsample_bytree = 0.8635308367372507, min_data_in_leaf = 47, random_state=42)
    rf_model.fit(X_train, y_train)

    # evaluate the model
    y_preds = rf_model.predict(X_test)

    y_pred = rf_model.predict(df_with_missing.drop([col], axis=1))
    
    acc_score = accuracy_score(y_test, y_preds)
    
    print("The feature '"+ col+ "' has been imputed with", round((acc_score * 100), 2), "accuracy\n")
    df_with_missing[col] = y_pred
    
    df_procesed = pd.concat([df_with_missing, df_without_missing], axis=0)

The feature 'fbs' has been imputed with 80.12 accuracy

The feature 'restecg' has been imputed with 65.22 accuracy

The feature 'exang' has been imputed with 79.19 accuracy

The feature 'slope' has been imputed with 69.11 accuracy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value 

The feature 'thal' has been imputed with 72.41 accuracy

The feature 'ca' has been imputed with 66.13 accuracy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_missing[col] = y_pred


In [7]:
df_procesed = df_procesed[df_procesed['trestbps'] >50]
df_procesed = df_procesed[(df_procesed['chol'] > 30) & (df_procesed['chol']<300)]
df_procesed.drop_duplicates(inplace=True) 

df['num_bins']=pd.cut(df['num'], bins=[0,1,2,3,4], include_lowest=True)


In [8]:
df_outlier = df_procesed[(df_procesed['chol'] < 30)]


In [9]:
imputer = KNNImputer(n_neighbors=5)

# apply KNN Imputer to the entire dataset
df_procesed = pd.DataFrame(imputer.fit_transform(df_procesed), columns=df_procesed.columns)

# show the first rows of the transformed dataset
df_procesed.head()
# Here I am also creating an outlier dataset to examine causal inference 
df_procesed = df_procesed[(df_procesed['trestbps'] > 50)]
df_outlier = df_procesed[(df_procesed['trestbps'] < 50)]
df_procesed = df_procesed[(df_procesed['chol'] > 30)]
df_outlier = df_procesed[(df_procesed['chol'] > 30)]

from sklearn.preprocessing import QuantileTransformer 
scaler = QuantileTransformer(output_distribution='normal')
numerical_features = [ 'trestbps', 'chol', 'thalch', 'oldpeak']
#df_procesed[numerical_features] = scaler.fit_transform(df_procesed[numerical_features])


In [9]:
df_procesed.to_csv('heart_disease_uci_processed.csv',index = False)


## Loan Approval

In [14]:
df = pd.read_csv('loan_data.csv')
columns_to_encode = ['person_gender', 'person_education', 'person_home_ownership', 'loan_intent', 'previous_loan_defaults_on_file']
num_columns = [
    'person_income', 
    'loan_amnt', 
    'loan_int_rate', 
    'loan_percent_income', 
    'cb_person_cred_hist_length', 
    'credit_score'
]
encoder = LabelEncoder()
scaler = StandardScaler()
df_procesed = df.copy()
for column in columns_to_encode:
    df_procesed[column] = encoder.fit_transform(df_procesed[column])

df_procesed.to_csv('loan_data_processed.csv',index = False)


In [15]:
df_procesed

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22.0,0,4,71948.0,0,3,35000.0,4,16.02,0.49,3.0,561,0,1
1,21.0,0,3,12282.0,0,2,1000.0,1,11.14,0.08,2.0,504,1,0
2,25.0,0,3,12438.0,3,0,5500.0,3,12.87,0.44,3.0,635,0,1
3,23.0,0,1,79753.0,0,3,35000.0,3,15.23,0.44,2.0,675,0,1
4,24.0,1,4,66135.0,1,3,35000.0,3,14.27,0.53,4.0,586,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44995,27.0,1,0,47971.0,6,3,15000.0,3,15.66,0.31,3.0,645,0,1
44996,37.0,0,0,65800.0,17,3,9000.0,2,14.07,0.14,11.0,621,0,1
44997,33.0,1,0,56942.0,7,3,2771.0,0,10.02,0.05,10.0,668,0,1
44998,29.0,1,1,33164.0,4,3,12000.0,1,13.23,0.36,6.0,604,0,1
