### Project: Loan Eligibility Prediction

- Use **[gender, education, income, credit history, property area]** to make a model to predict **[if a loan will be approved or denied]**
- Identify patterns in **key features**
- Predict **loan amount**
- Identify patterns in **the impact of credit history**
- Identify patterns in **demographic analysis**
- Identify patterns **between loan term and loan eligibility**
- Identify patterns in **the impact of property area**

### Load Libraries

In [9]:
# Import foundational libraries
import pandas as pd # Data manipulation
import numpy as np # Numerical Operations
import seaborn as sns # Statistical Data Visualization
sns.set_theme(style="darkgrid")
import sklearn
import matplotlib.pyplot as plt # Makes Matplotlib works like MATLAB
import matplotlib.patches as mpatches # Creates shapes

# Import plotly (library) for data visualization
import plotly.graph_objs as go # Creates Plotly graphs
from plotly.tools import make_subplots # Creates subplots to combine plots into one figure
from plotly.offline import iplot, init_notebook_mode # Dispalys plotly in Jupyter
init_notebook_mode(connected = True)
import plotly.express as px # Simplifies creating plotly graphs

# Additional Imports
from sklearn.impute import SimpleImputer # Handles missing values
import warnings # Manages warnings
warnings.filterwarnings("ignore")

# Import statistical analysis library (scipy)
from scipy import stats
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency
import statsmodels.api as sm

# Import algorithm libraries for data analysis
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#Algorithms for handling Imbalance Data
from imblearn.over_sampling import SMOTE, ADASYN

# Version Check
print(pd.__version__)
print(np.__version__)
print(sklearn.__version__)
print(sns.__version__)
print(plt.matplotlib.__version__)
# print(plotly.__version__) # Only module is used, not full library
# print(stats.__version__) # Has no attribute for version
print(sm.__version__)
# print(imblearn.__version__) # Only module is used, not full library

1.4.4
1.21.5
1.0.2
0.11.2
3.5.2
0.13.2


### Prepare Dataset

In [17]:
# Import the csv data
train = pd.read_csv('../data/raw/loan-train.csv')
test = pd.read_csv('../data/raw/loan-test.csv')
data = train.copy()

In [18]:
# Prints for TRAIN dataset
print(f"Total {train.shape} Columns and Rows in the Train Dataset")
# Shape is stored as a 2D Array; [0] is each line (row) and [1] is each item (col)
print(f"Total {train.shape[0]} Rows in the Train Dataset")
print(f"Total {train.shape[1]} Columns in the Train Dataset \n")

# Prints for TEST dataset
print(f"Total {test.shape} (Rows, Col) in TEST Dataset")

Total (614, 13) Columns and Rows in the Train Dataset
Total 614 Rows in the Train Dataset
Total 13 Columns in the Train Dataset 

Total (367, 12) (Rows, Col) in TEST Dataset


In [19]:
# Data is a copy of TRAIN
data.head() # .head() is from pandas; it views the first few rows

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
