This notebook will briefly explore an example dataset that includes info about applicants' eligibility for loans of various amounts. It was created in a Databricks workspace. The original dataset is available [here](https://github.com/MainakRepositor/Datasets/blob/master/Loan%20Eligibility/loan-test.csv). For the fully rendered version of this notebook with outputs, see [the corresponding HTML file](https://github.com/Nick-Eagles/data_engineering_practice/blob/master/databricks/notebooks/explore_data.html).

In [0]:
#   Load required libraries
!pip install plotnine
from plotnine import *
from pyspark.sql import functions as F

In [0]:
#   Read in the example data and glimpse the whole thing
#   (there are not many rows in this case)
loan_df = spark.table("workspace.default.loan_eligibility")
loan_df.display()

I hypothesize that the loan amount will be highly correlated with applicant income, with the thought that income should be a major factor in an applicant's ability to pay back a given amount.

In [0]:
#   Plot loan amount against applicant income, colored by education status
(
    ggplot(
        loan_df.toPandas(),
        aes(x = 'ApplicantIncome', y = 'LoanAmount', color = 'Education')
    ) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = 'lm') +
    theme_bw(base_size = 15)
)

Okay, so there are some outliers in income making it hard to see if there's a linear relationship in the bulk of the data. Let's look at the bottom 95% of incomes and otherwise plot the same data again.

In [0]:
cutoff = loan_df.approxQuantile('ApplicantIncome', [0.95], 0.01)
loan_df_filtered = loan_df.filter(F.col("ApplicantIncome") <= cutoff[0])

In [0]:
#   Same plot with income outliers removed
(
    ggplot(
        loan_df_filtered.toPandas(),
        aes(x = 'ApplicantIncome', y = 'LoanAmount', color = 'Education')
    ) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = 'lm') +
    theme_bw(base_size = 15)
)

It turns out there is no strong linear relationship in the data. We'll stop here and do some additional exploration in the Databricks SQL Editor.