# Credit Risk Morgage Loans
The data is provided by [Home Credit](https://www.homecredit.net/about-us.asp), who provides lines of credit (loans) to the unbanked population". There are 307,511 rows with 122 columns. 

Each SK_ID_CURR in the test set, will predict a probability for the TARGET variable. The final prediction file should contain a header and have the following format:
SK_ID_CURR,TARGET <br/>
100001,0.1 <br/>
100005,0.9 <br/>
100013,0.2 <br/>

## Seven Datasets Summary
Original dataset csv files can be found on [Kaggle](https://www.kaggle.com/c/home-credit-default-risk). The columns with first five rows will be shown below to view whenever a dataset is used. Therefore, one will not have to download the csv files. There are seven sources of data for this project which will be briefly
described below:
* Train.csv: This is the most important dataset with 307,511 rows which are house data. There are 106 column features describing houses such as square feet and year built. The column TARGET column is an important feature to discuss. A 1 in this row means the loan struggled to payback. A 0 means the loan was did not default. Some of the features will need to be encoded numerical to test if they have high feature importance.
* bureau.csv: Other previous credit data from other financial institutions. 
* bureau_balance.csv: Monthly bureau previous credits.
* brevious_application.csv: Previous appliation loans.
* POS_CASH_BALANCE.csv: Monthly data about previous cash loans. 
* credit_card_balance.csv: Monthly credit card data for clients with Home Credit.
* installments_payment.csv: Payment history for previous loans.
<br/> <br/>

## View Train.csv Data
The training dataset is the most important dataset with over three-hundred thousand house prices that will be predicted at the very using using the best metrics predictive models with reduced error. The first five rows of the train.csv file will be shown below.

In [54]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_validate
from statistics import mean
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.datasets import make_classification
from sklearn import ensemble
#sample=r'/kaggle/input/home-credit-default-risk/sample_submission.csv'
#cash=r'/kaggle/input/home-credit-default-risk/POS_CASH_balance.csv'
#info='/kaggle/input/home-credit-default-risk/HomeCredit_columns_description.csv'
#app=r'/kaggle/input/home-credit-default-risk/previous_application.csv'
#cc=r'/kaggle/input/home-credit-default-risk/credit_card_balance.csv'
#install=r'/kaggle/input/home-credit-default-risk/installments_payments.csv'
bureau_balance=r'/kaggle/input/home-credit-default-risk/bureau_balance.csv'
train=r'/kaggle/input/home-credit-default-risk/application_train.csv'
test=r'/kaggle/input/home-credit-default-risk/application_test.csv'
data=pd.read_csv(train) # (307511, 122)
test=pd.read_csv(test)
data.head()

## Examine TARGET column
How many loans were not repaid? In train.csv a 0 stands for repaid and 1 stands for payment difficulties. The percent of loans that defauled was 0.081. This is somewhat unbalanced data so we must be careful when selecting what metrics to use to analyze the data. In addition, we must consider other data files for feature importance.

In [29]:
temp=data['TARGET'].value_counts()
print(temp)
paid=temp[0]
notPaid=temp[1]
default=round(notPaid/(paid+notPaid),3)
print("Percent of loans that defauled: ", default)

## Find missing values
Too many missing values on a column will get the colunmn removed. Since there are 60 numeric columns with missing data, we need to interpret the Buraeu to find feature importance in order to engineer which columns are most worth keeping.

In [53]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df = data.select_dtypes(include=numerics) # (307511, 106)

#search for columns with missing values:
def findNA():
    print("Missing data by column as a percent:")
    findNA=df.isnull().sum().sort_values(ascending=False)/len(data)
    findNA.head(60)
#findNA() 

## Remove columns and Fill Missing Values

In [31]:
number=20 #remove col with  or more missing values
df = df[df.isnull().sum(axis=1) <= number] 
df= df.fillna(df.mean())

## Heat Map Correlations and Multicollinearity
There is no major multicollinearity. In fact, there are not many correlated variables. The following heatmap is set for correlations above .05 because there are so few variables that are highly correlated.

In [32]:
def printHeat():
    corr = df.corr()
    #print(corr)
    y='TARGET'
    highly_corr_features = corr.index[abs(corr[y])>0.05]
    plt.figure(figsize=(10,10))
    heat = sns.heatmap(df[highly_corr_features].corr(),annot=True,cmap="RdYlGn")
    top10=corr[y].sort_values(ascending=False).head(10)
    print(heat)
    print("Top 10 Correlations:\n", top10) # top ten correlations
#printHeat()

In [52]:
'''
## View Buraeu Data 
The Buraeu data has [1716428 rows x 17 columns]. Four columns had lots of missing data so they are deleted.
Then, we remove a small portion of missing values just to get a general analysis of the missing data.
The goal is to use this additional information outside of the train set to try to find feature importance. 
'''

buraeuData=r'/kaggle/input/home-credit-default-risk/bureau.csv'
buraeuDF=pd.read_csv(buraeuData) #[1716428 rows x 17 columns]

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
buraeuDF = buraeuDF.select_dtypes(include=numerics) #(1716428, 14)
# Six columns have missing values:
bNA=buraeuDF.isnull().sum().sort_values(ascending=False)/len(buraeuDF) 
buraeuDF=buraeuDF.dropna(thresh=0.8*len(buraeuDF), axis=1) #(1716428, 10)
buraeuDF = buraeuDF.dropna() #(1376391, 10)
buraeuDF.head()

## Split Data
Split the data set into training data and test data. TARGET will always be Y since it is the independent variable. A 1 is a troubled loan while a 0 equals a not distressed loan. 

In [None]:
X=df.drop('TARGET', axis=1)
y=df['TARGET'] #indepdent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)

## Gradient Booster and Feature Importance

In [None]:
from sklearn.inspection import permutation_importance
from sklearn.ensemble import GradientBoostingClassifier,GradientBoostingRegressor

params = {
 "n_estimators": 5, "max_depth": 4, "min_samples_split": 5, "learning_rate": 0.01,
}

reg = ensemble.GradientBoostingRegressor(**params)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
gbr_r2 = r2_score(y_test, y_pred).round(4) 
print("Gradient boosting regression r2: ", gbr_r2) 

mse = mean_squared_error(y_test, reg.predict(X_test))
print("The mean squared error (MSE) on test set: {:.4f}".format(mse))

#FEATURE IMPORTANCE:
num=10 # How many features?
cols=X.columns
feature_importance = reg.feature_importances_[:num]
sorted_idx = np.argsort(feature_importance)[:num]
pos = np.arange(sorted_idx.shape[0]) + 0.5
fig = plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, np.array(cols)[sorted_idx])
plt.title("Feature Importance (MDI)")

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso

lrModel = LinearRegression()
lrModel.fit(X_train, y_train)
#print(model.coef_)
#print(model.intercept_)

#Generate Predictions:
predictions = lrModel.predict(X_test)

# plt.scatter(y_test, predictions)
plt.hist(y_test - predictions)

#Performance measurement:
import sklearn.metrics as metrics
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix, r2_score
#print(classification_report(y_test_data, predictions))
#print(confusion_matrix(y_test_data, predictions))

metrics.mean_absolute_error(y_test, predictions)
np.sqrt(metrics.mean_squared_error(y_test, predictions))

#use model to predict probability that given y value is 1:
y_pred_proba = lrModel.predict(X_test)
r2 = r2_score(y_test, y_pred_proba).round(4) 
print("Linear regression r2 score: ", r2)

#CROSS VALIDATE TEST RESULTS:
lr_score = lrModel.score(X_test, y_test).round(4)  # train test 
print("Accuracy: ", lr_score)
lr_cv = cross_validate(lrModel, X, y, cv = 5, scoring= 'r2')
lr_cvMean=lr_cv['test_score'].mean().round(4)
print(lr_cvMean, " linear regression cross validate mean")

### Resources
1. https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction