# Numerical variables
This example is based on the examples posted on GitHub for [Feature Engineering for Machine Learning Course](https://github.com/solegalli/feature-engineering-for-machine-learning)

Most variables in a dataset can be classified into one of two major types: **Numerical variables** &
**Categorical variables**

**Numerical variables** can be further classified into:
- **Discrete variables** : whole numbers (counts), e.g., Number of children in the family
- **Continuous variables** : may contain any value within a range, eg. house price

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Let's load the loans dataset.
df = pd.read_csv('./datasets/loan.csv')
print(type(df))
df.head()

# Rename columns
# Variable definitions:
- disbursed_amount: loan amount given to the borrower.
- interest: interest rate.
- income: annual income.
- target: loan status (paid or being repaid = 1, defaulted = 0).

In [None]:
df = df.rename(columns={"disbursed_amount": "loan_amount", "income": "annual_income", "target": "loan_status"})
print(df.columns)

## Continuous variables

In [None]:
# Let's look at the values of the variable loan_amount.
# This is the amount of money requested by the borrower. This variable is continuous.
print(df['loan_amount'].unique())
df.describe()

In [None]:
# Let's make a histogram to get familiar with the
# variable distribution.

# A histogram counts of different intervals (or bins) of a continuous variable. 
# The x-axis intervals, while the y-axis shows the frequency or count of 
# observations within each interval.
# Bins are the number of intervals you want to divide all of your data into
fig = df.loan_amount.hist(bins=50)

fig.set_title('Requested loan amount')
fig.set_xlabel('Loan amount')
fig.set_ylabel('Number of loans')

The variable's values vary across the entire value range. This is characteristic of continuous variables.

In [None]:
# To create a histogram with a percentage y-axis, 
# The code above is using the `histplot` function from the seaborn library (imported as `sns`) to create a histogram of the "loan_amount" variable from the dataframe `df`.
# The `stat` parameter is set to "probability". 
# This means that the height of each bar will represent the probability 
# of data points falling into that bin, rather than the count of data points. 

ax = sns.histplot(data=df, x="loan_amount", bins=30, stat="probability")
ax.set(title='Requested loan amount probability distribution', 
    xlabel='Loan amount')

"""
# Another way to create a histogram with a percentage y-axis is to use the `weights` 
# parameter of the `hist` function 

observations_count = len(df.loan_amount)

fig = df.loan_amount.hist(bins=50, 
    weights= [1/len(df.loan_amount)] * observations_count)

fig.set_title('Requested loan amount probability distribution')
fig.set_xlabel('Loan amount')
fig.set_ylabel('Percentage')
"""

In [None]:
# Let's examine the variable interest rate,
# This variable is also continuous: it can take, in principle,
# any value within the range.
df['interest'].unique()

In [None]:
# Let's make a histogram to get familiar with the
# variable distribution.

fig = df['interest'].hist(bins=30)

fig.set_title('Interest Rate')
fig.set_xlabel('Interest Rate')
fig.set_ylabel('Number of Loans')

The variable's values vary continuously across the entire value range.

In [None]:
# Now, let's explore the income declared by the customers,
# that is, how much they earn yearly.

# This variable is also continuous.

fig = df['annual_income'].hist(bins=100)

# For better visualisation, I display a specific
# range in the x-axis.
fig.set_xlim(0, 400000)

# title and axis legends
fig.set_title("Customer's Annual Income")
fig.set_xlabel('Annual Income')
fig.set_ylabel('Number of Customers')

Only a few consumers earn greater salaries, with the majority of salaries falling between USD 30 and USD 70,000.
Because this is a continuous variable, the variable's values vary continuously across the variable range.

## Discrete variables

In [None]:
# Let's inspect the values of the number_open_accounts variable.
# This variable represents the borrower's total number of credit items (for example, credit cards, car loans, mortgages, etc.). 
# This is a discrete variable, because a borrower can have 1 credit card, but not 3.5 credit cards.

# Remove missing values using dropna then get unique ones
df['number_open_accounts'].dropna().unique()

In [None]:
# Let's make an histogram to get familiar with the variable distribution.
y = df['number_open_accounts'].value_counts().sort_index()
print(y)
fig = y.plot.bar()

# For better visualisation, I display a specific
# value range in the x-axis.
fig.set_xlim(0, 30)

# Title and axis labels.
fig.set_title('Number of open accounts')
fig.set_xlabel('Number of open accounts')
fig.set_ylabel('Number of Customers')

Use a bar chart for discrete variables and histograms for continuous variables