[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/5_data_prep_solutions.ipynb) 

# Tutorial 5 - Pandas library for data preparation

<span style="font-weight: bold; color: red;">This version includes solutions to the exercises. </span>

The notebook revisits our lecture on EDA and data preparation. In this scope, you will further deepen your understanding of **Pandas**, the goto library for working with tabular data in Python. We will exemplify two core Pandas classes *data series* and *data frame*. To that end, the demo notebook introduces a real-world data associated with credit scoring. 

Here is the outline for today:
- The HMEQ data set
- Pandas reloaded ...
- Data preparation
- Explanatory data analysis

Before moving on, let's import some of our standard library so that we have them ready when we need them.

In [None]:
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt


## The HMEQ data set
Our data set, called the  "Home Equity" or, in brief, HMEQ data set, is provided by www.creditriskanalytics.net. It comprises  information about a set of borrowers, which are categorized along demographic variables and variables concerning their business relationship with the lender. A binary target variable called 'BAD' is  provided and indicates whether a borrower has repaid her/his debt. You can think of the data as a standard use case of binary classification.

You obtain the data, together with other interesting finance data sets, directly from www.creditriskanalytics.net. The website also provides a brief description of the data set. Specifically, the data set consists of 5,960 observations and 13 features including the target variable. The variables are defined as follows:

- BAD: the target variable, 1=default; 0=non-default 
- LOAN: amount of the loan request
- MORTDUE: amount due on an existing mortgage
- VALUE: value of current property
- REASON: DebtCon=debt consolidation; HomeImp=home improvement
- JOB: occupational categories
- YOJ: years at present job
- DEROG: number of major derogatory reports
- DELINQ: number of delinquent credit lines
- CLAGE: age of oldest credit line in months
- NINQ: number of recent credit inquiries
- CLNO: number of credit lines
- DEBTINC: debt-to-income ratio

As you can see, the features aim at describing the financial situation of a borrower. We will keep using the data set for many modeling tasks in this demo notebook and future demo notebook. So it makes sense to familiarize yourself with the above features. Make sure you understand what type of information they provide and what this information might reveal about the risk of defaulting.  

---

# Foundations of the Pandas Library

## Loading data from the WWW
The `Pandas` library supports various ways to load data from, e.g., your hard disk, a server somewhere in your network, etc. Here, we consider the easiest setting, which is loading data from the web. All we need for this is an URL. The following code loads the data directly from our [BADS repository](https://github.com/Humboldt-WI/bads).

In [None]:
import pandas as pd  # import library

# Load the data directly from GitHub
data_url = 'https://raw.githubusercontent.com/Humboldt-WI/bads/master/data/hmeq.csv'
df = pd.read_csv(data_url)

# To show that everything worked out, we can print the first few rows of the data frame
df.head(10)  # print ten rows

## Eyeballing data 
The Pandas data frame provides a ton of useful functions for data handling. We begin with showcasing some standard functions that one needs every time when working with data. 

In [None]:
# Query some properties of the data
print('Dimensionality of the data is {}'.format(df.shape))  # .shape returns a tuple
print('The data set has {} cases.'.format(df.shape[0]))     # we can also index the elements of that tuple
print('The total number of elements is {}.'.format(df.size))

In [None]:
# Obtain a more technical overview of the data 
df.info()

The above output displays some useful information how exactly data is stored. We learn about the data type of each feature (i.e, each *data series* object), missing values, and the total amount of memory that the data consumes.

In [None]:
# Produce summary statistics (to R-programmers: this is equivalent to the famous R function summary())
df.describe()

The previous demos gave as an overview of the data. However, if you compare the output to the `describe()` function list of features given on the www.creditriskanalytics.net website (see above), you will notice that we are missing some features. For example, we lack a summary of the feature *REASON*; same with *JOB*. If you think about it, that actually makes sense. The result from the function `info()` showed how these features are stored as data type *object*. They are not stored as numeric variables. Consequently, statistical / mathematical operations like computing a mean or quantile are undefined and cannot be computed for these variables. That said, you can still force the `describe()` function to consider all features in its output. 

## Navigating data
We discussed indexing and slicing in the contexts of Python `lists` and other containers like dictionaries. In `Pandas`, `Numpy`, and other libraries, indexing/slicing are equally important and work in similar ways. Here, we provide a few more demos on common ways to use indexing in `Pandas`. A web search for "pandas data frame indexing" will provide many additional insights if you are interested. Likewise, feel free to skip this part if you already feel comfortable with data frame indexing.

### Basic indexing of rows and columns

In [None]:
# Accessing a single column by name
df['BAD']
# Alternatively, you can access a single column using dot-notation
df.BAD

For the *R* programmers: we can index our data in a way similar to *R*. Note the use of `loc[]`. This is a special type of syntax you need to memorize. Also note that we specify the columns we want to index using a `list`. Hence the inner box bracket.

In [None]:
# R-style indexing of selected rows and columns
df.loc[0:4, ["BAD", "LOAN"]]  # select row 0, 1, 2, 3 and for those rows only the columns BAD and LOAN

In [None]:
# Access columns by a numerical index using .iloc
df.iloc[0:4, 0]
df.iloc[0:4, [0, 3, 5]]
df.iloc[0:4, np.arange(4)]

A few cautionary notes on numerical indexing in Python. The function `iloc()` considers the index of the data frame. In the above output, this is the left-most column without header. We have not defined a custom row index and Python uses consecutive integer numbers by default. However, a data frame could also have a custom index. In such a case, calls to `iloc()` need to refer to the custom index. It is good practice to eyeball a data frame and verify the way in which rows are indexed prior to using `iloc()`.
<br>

### Other common forms of indexing and subset selection
It is also common practice to select rows based on comparisons of feature values using. You can achieve this using `.loc`. Here are a few examples:

In [None]:
df.loc[df.BAD == 1, :]  # Get all observations with target variable BAD = 1. The : means you want to retrieve all columns 

In [None]:
df.loc[df["NINQ"]>12, ["LOAN", "VALUE", "NINQ"]]  # Another example where we select only a subset of the columns

When working with high-dimensional data sets, you will often perform certain actions only with columns or a specific data type. To that end, you should know the function `select_dtypes`.

In [None]:
df.select_dtypes(float)  # select all numerical columns 

## Manipulating data
We often have to manipulate data. For example, imputing missing values as part of data preparation (see later) will require us to change the data stored in a data frame. `Pandas` supports many ways to manipulate data. Let's introduce a few common options. 

### Using in-build Pandas functions
Many functions that `Pandas`provide result in data changes. One example is the `sort_values` function, which we demonstrate below. By default, functions like `sort_values` do not alter the data in a `DataFrame` directly. Instead, they return a new `DataFrame` in which the data was changed. Here is an example. 


In [None]:
df_sorted = df.sort_values(by="LOAN", ascending=False)  # We can specify the column by which to sort and the order; next to other arguments
df_sorted.head(10)  # Print a preview of the data; like above when introducing the method .head()

Note the row index (left-most column). The index tells us that the order of the rows is different. That was to be expected because we sorted the the feature *LOAN*. In the original data, which we store in the variable `df`, we still have the original row order.

In [None]:
df.head(10)

The point of the demo was to show that, by default, `Pandas` function will not alter the `DataFrame` directly. Therefore, you see many codes of this form: 
```
new_data_frame = old_data_frame.someFunction()
``` 
Occasionally, you can overwrite this default behavior. Some `Pandas` functions provide an argument `inplace`. Setting this argument to `True` would then alter a `Data Frame` directly.

In [None]:
# df.sort_values(by="LOAN", ascending=False, inplace=True)  # Running this line would change your data frame
df.head(10)

### The apply function
If you have used R, you will know the `apply()` function. It kinda does what the name suggests. It let's you define a function and apply that function to every element in a data frame. Combine that with indexing and you obtain a powerful way to selectively alter your data. 
<br>
We provide some demos in the following, where, for simplicity, we consider only the numerical features. 

In [None]:
df_numerical = df.select_dtypes(float) 

Silly example: say you want to square the values of all your features. You can achieve this by calling `.apply()` on a `DataFrame` providing a suitable function as argument. In this - silly - example, we can use the in-built `Numpy` function `square`. However, we could also use a custom function, or define the function directly within the call to `.apply`. The latter is a more advanced Python concept known as *lambda function*. Websearch for it if interested.

In [None]:
# All three examples below are equivalent

# Using apply together with an existing function
df_squared = df_numerical.apply(np.square)  # note that the reference the function. Thus it is np.square and not np.square(). When adding brackets, we call the function. 

# Using apply together with a custom function
def my_square(x):
    return x*x

df_squared = df_numerical.apply(my_square)

# Using apply together with a lamda function
df_squared = df_numerical.apply(lambda x: x * x) # you can define a function directly like here, we have a square function

df_squared.head(10)

So this was apply in action. Writing your own custom function and then feeding every column of a data frame or a selection thereof - by indexing - let you perform some powerful operations. We will see more meaningful use cases as we go along (spoiler alert: we use `apply()` for outlier handling below) 

# Data preparation
Data preparation is a mega-topic. It will accompany us throughout the whole course. I this part, we focus on some typical issues in our data and demonstrate how to perform standard data prep tasks using `Pandas`. 

### Altering data types
We start with a rather technical bit, data types. Remember the way our data is stored at the moment. 

In [None]:
df.info()

The features *JOB* and *REASON* are stored as data type `object`. This is the most general data type in Python. A variable of this type can store pretty much any piece of data, numbers, text, dates, times, ... This generality has a price. First, storing data as data type `object` consumes a lot of memory. Second, we cannot access specific functionality that is available for a specific data type only. Functions to manipulate text are an example. These are available for data of type `string` but not for data of type `object`. 
<br>
In our case, the two features that Pandas stores as objects are actually categorical variables. We can easily verify this using, e.g., `value_counts`.

In [None]:
print(df.REASON.value_counts())  # so REASON is a binary variable

In [None]:
print(df.JOB.value_counts())  # JOB is a categorical variable with many levels

Knowing our two "object features" are categories, we should alter their data type accordingly. To that end, we make use of the function `astype`, which facilitates converting one data type into another. 

In [None]:
# Code categories properly 
df['REASON'] = df['REASON'].astype('category')
df['JOB'] = df['JOB'].astype('category')
df.info()  # verify the conversion was successful

Although it does not really matter for this tiny data set, note that the conversion from object to category has reduced the amount of memory that the data frame consumes. On my machine, we need 524.2 KB after the translation, whereas we needed more than 600 KB for the original data frame. If you work with millions of observations the above conversion will result in a significant reduction of memory consumption. If memory consumption is an issue, we could a significant further reduction by reducing the precision of the numerical variables. Downcasting from float64 to float32 bit might is likely ok for predictive modeling. Also, the target variable is stored as an integer but we know that it has only two states. Thus, we can convert the target to a boolean.

In [None]:
# The target variable has only two states so that we can store it as a boolean
df['BAD'] = df['BAD'].astype('bool')

# For simplicity, we also convert LOAN to a float so that all numeric variables are of type float
df['LOAN'] = df['LOAN'].astype(np.float64)

# Last, let's change all numeric variables from float64 to float32 to reduce memory consumption
num_vars = df.select_dtypes(include=np.float64).columns
df[num_vars] = df[num_vars].astype(np.float32)

Invest some time to understand the above codes. Our examples start to combine multiple pieces of functionality. For example, the above demo uses indexing, functions, and function arguments to perform tasks. Keep practicing and you will become familiar with the syntax.
<br>
Finally, let's verify our changes once more.

In [None]:
# Check memory consumption after the conversions
df.info()

In total, our type conversions reduced memory consumption by more than a half. You might want to bear this potential in mind when using your computer to process larger data sets. Should you be interested in some more information on memory efficiency, have a look at this post at [TowardDataScience.com](https://towardsdatascience.com/pandas-save-memory-with-these-simple-tricks-943841f8c32). 

## Missing values
You might have already noticed that our data contains many missing values. This is common when working with real data. Likewise, handling missing values is a standard task in data preparation. `Pandas` provides the function `.isna()` as entry point to the corresponding functionality and helps with identifying the relevant cases.

*Note*: `Pandas` also supports an equivalent function called `.isnull()`. 

In [None]:
# Boolean mask of same size as the data frame to access missing values via indexing
missing_mask = df.isna()

print(f'Dimension of the mask: {missing_mask.shape}')
print(f'Dimension of the data frame: {df.shape}')

missing_mask


We can now count the number of missing values per row or per column or in total.

In [None]:
# missing values per row
miss_per_row = missing_mask.sum(axis=1)
print('Missing values per row:\n', miss_per_row)

# missing values per column
miss_per_col = missing_mask.sum(axis=0)
print('Missing values per column:\n', miss_per_col )

# count the total number of missing values
n_total_missing = missing_mask.sum().sum()
print(f'Total number of missing values: {n_total_missing}')

It can be useful to visualize the *missingness* in a data set by means of a heatmap. Note how the below example gives you a good intuition of how and where the data set is affected by missing values. 

In [None]:
sns.heatmap(df.isna())  # quick visualization of the missing values in our data set
plt.show()

### Categorical features
Let's start with the two categorical features. The heatmap suggests that `REASON` exhibits more missing values than `JOB`. We will treat them differently for the sake of illustration. Now that we start altering our data frame more seriously, it is a good idea to make a copy of the data so that we can easily go back to a previous state.

In [None]:
# copy data
df_orig = df.copy()

#### Adding a new category level
One way to treat missing values in a categorical feature is to introduce a new category level "IsMissing". We will demonstrate this approach for the feature *REASON*. 
<br>One feature of the category data type in Pandas is that category levels are managed. We cannot add levels directly. Thus, before assigning the missing values our new category level *IsMissing*, we first need to introduce this level. We basically tell our data frame that *IsMissing* is another suitable entry for *REASON* next to the levels that already exist in the data frame. 

In [None]:
# Variable REASON: we treat missing values as a new category level.
# First we need to add a new level
df.REASON = df.REASON.cat.add_categories(['IsMissing'])

# Now we can do the replacement
df.REASON[df.REASON.isnull() ] = "IsMissing"
df.REASON.head()

In [None]:
df.REASON.isna().sum()  # verify that no more missing values exist

#### Mode replacement
For the feature *JOB*, which is multinomial, we replace missing values with the mode. Please note that this is a crude way to handle missing values. I'm not endorsing it! But you should have at least seen a demo. Here it is. 

In [None]:
# Determine the mode
mode_of_job = df.JOB.mode()
print(mode_of_job)

In [None]:
# replace missing values with the mode
df.JOB[df.JOB.isnull() ] = df.JOB.mode()[0]  # the index [0] is necessary as the result of calling mode() is a Pandas Series
# verify that no more missing values exist
df.JOB.head()

In [None]:
# Verify more seriously that missing value replacement was successful
if df.REASON.isnull().any() == False and df.JOB.isnull().any() == False:
    print('well done!')
else:
    print('ups')

### Numerical features
We have a lot of numerical features. To keep things simple, we simply replace all missing values with the median. Again, this is  a crude approach that should be applied with care; if at all. However, it nicely shows how we can process several columns at once using a loop. 

In [None]:
for col in df.select_dtypes(include='float32').columns:  # loop over all numeric columns
    if df[col].isna().sum() > 0:                         # check if there are any missing values in the current feature
        m = df[col].median(skipna=True)                  # compute the median of that feature
        df[col].fillna(m, inplace=True)                  # replace missing values with the median

Should you wonder whether it is necessary to write a loop to perform this rather standard operation, the answer is no. You could achieve the same result more elegantly when combining the `fillna()` method with a call to the method `transform()`. Here is how this would look like:
```python
# Alternative approach to impute missing values with the feature median
cols = df.select_dtypes(include='float32').columns 

df[cols] = df[cols].transform(lambda x: x.fillna(x.median()))
``` 
The function `transform()` applies a function to each column of the DataFrame. The lambda function takes each column, fills the missing values with the median of that column, and returns the transformed column. This way, you avoid looping over each column manually. This version can be considered more elegant, but our first shot, writing a loop, may legitimately be considered more readable.

In [None]:
# Verify there are no more missing values in the data
n_total_missing = df.isna().sum().sum()
if  n_total_missing == 0:
    print('Well done, no more missing values!')
else:
    print(f'Ups! There are still {n_total_missing} missing values.')


# Summary of useful Pandas functions

Many useful tricks with `Pandas` (here `df` is a pandas DataFrame and `col` is one of its columns):

| Goal | Possible Code |
| --- | --- |
| Get df column (column name must have no spaces) | `df.col` |
| Get df column | `df["col"]` |
| Example condition: only select rows where `col1 > 1` | `df["col"] > 1` |
| Use index names to select rows and columns | `df.loc[row_list, col_list]` |
| Use index numbers to select rows and columns | `df.iloc[row_list, col_list]` |
| Get df column based on a condition | `df.loc[condition, ['col2','col3',...]]`|
| Group df by values of `col` | `df.groupby("col")` |
| Perform function on `col2` for each group of `col1` | `df.groupby("col1")["col2"].fun()` |
| Find value counts of each value in `col` | `df.groupby(['col']).size()`| 
| Get column mean and ignore null values | `df["col"].mean(skipna=True)` |
| Get column mode | `df["col"].mode()` |
| Get column median | `df["col"].median()` |
| Get rows of the 95th quantile of `col` | `df["col"].quantile(q=0.95)` |
| Filter `df` with a boolean condition | `df.query(condition)` |
| Create tally of `col2` by values of `col1` | `pd.crosstab(df['col1'], df['col2']`) |
| Pivot rows and columns | `df.pivot(index='col1', columns='col2', values='col3')` | 
| Sort values by `col` and save `df` in this order | `df.sort_values(by='col', inplace=True)` |
| Apply function to each column of `df` | `df.apply(fun)` |
| Save `df` as CSV in working directory | `df.to_csv('./file_name.csv', index=False)` |
| Count the number of times each value occurs | `df['col'].value_counts()` |
| Change column's data type | `df['col'] = df['col'].astype('type')` |
| Create boolean matrix of `df` where `True` indicates null value | `df.isnull()` | 
| Create boolean matrix of `df` where `True` indicates null value | `df.isna()` | 
| Create copy of df | `df_copy = df.copy()` |
| Add new category to categorical variable | `df.col.cat.add_categories(['New C'], inplace=True)` |
| Replace null values with `"IsMissing"` | `df.col[df.col.isnull()] = "IsMissing"` |
| Fill missing values with median and save `df` | `df['col'].fillna(median_value, inplace=True)` |
| Calculate time at execution (must import `time` library) | `time.time()` |

# Exercises


## 1. Dependency of loan amount and credit risk
Examine the dependency between the loan amount (i.e., feature `LOAN`) and the default risk. You find  information on the latter in the column `df["BAD"]`. A value of 1 indicates that a borrower is a defaulter (i.e., bad risk). Specifically:
1. Calculate the average of the feature `LOAN`
2. Calculate the average `LOAN` amount separately for bad and good risk using logical indexing. 
3. Interpret the results of your analysis. Is there a dependency between `LOAN` and default risk?
4. Re-calculate the average `LOAN` amount for good and bad risks. This time, make use of the function `group_by`, which exists for data frames.  
5. Extend the previous task by computing the group-wise median for all numerical features in the data frame


**Extension:** a nice extension of subtasks 1 to 3 would be to secure your interpretation with a statistical hypothesis test. Perhaps you know a suitable test. If not, run a web search for, e.g., *“statistical test for difference in means python”*.

In [None]:
df

In [None]:
# Solutions to the exercises
#--------------------------------------------------
# 1. Average income in the data
df["LOAN"].mean()

# 2. Average income among goods and bads separately
ix_allbad = df["BAD"] == True 
avg_bad = df.loc[ix_allbad, "LOAN"].mean()
print("Average loan amount among BADs: ", avg_bad)
print("Average loan amount among GOODs: ", df.loc[~ix_allbad, "LOAN"].mean())



In [None]:
# 4. Average feature values using groupby

# Note that the result of the grouping is a data frame. Thus, you can first group
# the data and then apply any other function that works for data frames including
# calculating aggregates and indexing. 
# To solve the task, we first group, then index, and finally calculate the mean 
# and do all of that in only one line 
df.groupby(by="BAD")["LOAN"].mean()

In [None]:
# 5. Group-wise median of all numerical features
df.groupby(by="BAD").median()

## 2. Outliers
The lecture introduced a rule of thumb saying that, for a given feature, a feature value $x$ can be considered an outlier if 
$$x >q_3(X) + 1.5 \cdot IQR(X)$$

where $q_3(X)$ denotes the third quantile of the distribution of feature $X$ and $IQR(X)$ the corresponding inter-quartile range.

1. Use the `Pandas` method `quantile` to compute the third and first quartile of feature `LOAN`.
2. Compute the threshold value that a feature value $x$ must not exceed according to the above equation. Store the result in a variable. 
3. Use logical indexing to identify all upper outliers in the feature `LOAN`.
4. Create a new data frame that has no outliers in the feature `LOAN`. To that end: 
- Reuse your solution to task 3 to identify outliers using indexing
- Change the `LOAN` values for all outlier cases to the threshold you computed in step 2.
5. Write a custom function that implements the functionality you created in task 4. Make the feature to work on an argument of your function.
6. Call your custom function for all numerical features in the data frame. The goal is to create a data frame that does not have any upper outlier in any of its numerical features. To demonstrate the capabilities of your function, set the threshold to $3 \cdot IQR(X)$. This way, only extreme outliers will be removed.

In [None]:
# 1. First and third quantile of the LOAN feature
quantiles = df["LOAN"].quantile(q=[0.25, 0.75])

# To extract the actual numbers into easy-to-use variables,
# we can first create a tuple and then use unpacking
q1, q3 = (quantiles.values)
print(f"The first and third quartile are, respectively {q1} and {q3}")

In [None]:
# 2. Threshold value for upper outlier
tau = q3 + 1.5*(q3-q1)
tau

In [None]:
# 3. Find upper outliers in the LOAN feature
ix_upper_outlier = df["LOAN"]>tau 
df.loc[ix_upper_outlier, "LOAN"]

In [None]:
# 4. Remove upper outliers in feature LOAN
df.loc[ix_upper_outlier, "LOAN"] = tau  # outlier truncation
df.loc[ix_upper_outlier, "LOAN"]  # print results to see the effect

In [None]:
# 5. Customer function for outlier detection and removal 
def outlier_truncation(x, factor=1.5):
    """
    Identifies outlier values based on the inter-quartile range IQR. 
    Corresponding outliers are truncated and set to a contant value equal to the IQR
    times a factor, which, following Tuckey's rule, we set to 1.5 by default
    
        Parameters:
            x (Pandas Series): A data frame column to scan for outliers
            factor (float): An outlier is a value this many times the IQR above q3/below q1
            
        Returns:
            Adjusted variable in which outliers are truncated
    """
    x_new = x.copy()
    
    # Calculate IQR
    IQR = x.quantile(0.75) - x.quantile(0.25) 
    
    # Define upper/lower bound
    upper = x.quantile(0.75) + factor*IQR
    lower = x.quantile(0.25) - factor*IQR
    
    # Truncation
    x_new[x < lower] = lower.astype(np.float32)  # downcasting to float32 is needed to ensure
    x_new[x > upper] = upper.astype(np.float32)  # compatibility with how we store the data in our data frame 
    
    return x_new


In [None]:

# 6. Application of the function to all numerical features in the data 

# Select numeric variables for outlier treatment. 
ix_numerical = df.select_dtypes(include="float32").columns

# Process every selected column using apply
# Updated 10.06.20 to show passing arguments to the 'applied' functions. Just send a tuple with arguments in the order as specified
# by the called function leaving out the first argument (see, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html)
df[ix_numerical] = df[ix_numerical].apply(outlier_truncation, axis=0, factor=3)  
df.describe()

## 3. Scaling numerical features
Another common data preparation task is scaling numerical features. The goal is to ensure that all features have the same scale. This is important for many machine learning algorithms. The lecture introduced two common scaling methods: min-max scaling and z-score scaling.
The `sklearn` library provides implementations of both approaches in the classes `MinMaxScaler` and `StandardScaler`, which are part of the module `preprocessing`. Expericence their functionality to solving the following exercises.

1. Import the class `MinMaxScaler` and `StandardScaler` from the module `preprocessing` in the library `sklearn`.
2. Familiarize yourself with the functioning of the `StandardScaler` using its documentation and other sources (e.g., web search). 
3. Test the `StandardScaler` by applying it to the numerical features `LOAN`. Afterwards, the scaled feature should have a mean of 0 and a standard deviation of 1. Write a few lines of code to verify this.
4. The use of the `MinMaxScaler` is similar to the `StandardScaler`. Apply the `MinMaxScaler` to all other numerical features in the data set. More specifically, 
- Create a new data frame that contains only the numerical features.
- Remove the feature `LOAN` from that data frame; as we already scaled it in task 3.
- Apply the `MinMaxScaler` to the new data frame.
- Write a few lines of code to verify that the scaling was successful. To that end, recall what the 'MinMaxScaler' does.
- Combine the scaled features with the feature `LOAN` and the categorical features in a new `DataFrame`.


In [None]:
# 1. Import the class MinMaxScaler and StandardScaler from the module preprocessing in the library sklearn
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# 2. Familiarize yourself with the functioning of the StandardScaler using its documentation and other sources (e.g., web search).
# For example, you could start here: https://scikit-learn.org/1.5/api/sklearn.preprocessing.html

# 3. Test the StandardScaler by applying it to the numerical feature LOAN. Afterwards, the scaled feature should have a mean of 0 and a standard deviation of 1. Write a few lines of code to verify this.
scaler = StandardScaler()
df["LOAN_scaled"] = scaler.fit_transform(df[["LOAN"]])

# Verify the scaling
print(f"Mean of LOAN_scaled: {df['LOAN_scaled'].mean()}")
print(f"Standard deviation of LOAN_scaled: {df['LOAN_scaled'].std()}")


In [None]:
# 4. Apply the MinMaxScaler to all other numerical features in the data set.
# Create a new data frame that contains only the numerical features.
df_numerical = df.select_dtypes(include="float32").copy()

# Remove the feature LOAN from that data frame; as we already scaled it in task 3.
df_numerical.drop(columns=["LOAN"], inplace=True)

# Apply the MinMaxScaler to the new data frame.
min_max_scaler = MinMaxScaler()
df_numerical_scaled = pd.DataFrame(min_max_scaler.fit_transform(df_numerical), columns=df_numerical.columns)

# Verify the scaling
print(f"Min values of scaled features:\n{df_numerical_scaled.min()}")
print(f"Max values of scaled features:\n{df_numerical_scaled.max()}")

# Combine the scaled features with the feature LOAN and the categorical features in a new DataFrame.
df_scaled = pd.concat([df[["LOAN_scaled"]], df_numerical_scaled, df.select_dtypes(include=["category", "bool"])], axis=1)
df_scaled.head()

## 4. Discretizing numerical features
Discretizing numerical features is another common data preparation task. The goal is to convert continuous numerical features into discrete bins or categories. This can be useful for certain types of analysis and modeling. The `pandas` library provides the `cut` and `qcut` functions for this purpose.

1. Familiarize yourself with the `cut` and `qcut` functions in the `pandas` library using their documentation and other sources (e.g., web search).
2. Use the `cut` function to discretize the `LOAN` feature into 5 equal-width bins. Assign meaningful labels to each bin (e.g., 'Very Low', 'Low', 'Medium', 'High', 'Very High').
3. Verify the binning by displaying the first few rows of the data frame and checking the `LOAN` feature.
4. Use the `qcut` function to discretize the `MORTDUE` feature into 4 quantile-based bins. Assign meaningful labels to each bin (e.g., 'Q1', 'Q2', 'Q3', 'Q4').
5. Verify the binning by displaying the first few rows of the data frame and checking the `MORTDUE` feature.
6. Create a new data frame that includes the discretized `LOAN` and `MORTDUE` features along with the other original features.
7. Write a custom function that takes a data frame and a list of numerical features as input and returns a new data frame with all specified features discretized into a given number of bins using the `cut` function. Test your function on the numerical features in the data frame.

In [None]:
# 2. Use the cut function to discretize the LOAN feature into 5 equal-width bins. Assign meaningful labels to each bin (e.g., 'Very Low', 'Low', 'Medium', 'High', 'Very High').
loan_bins = pd.cut(df["LOAN"], bins=5, labels=["Very Low", "Low", "Medium", "High", "Very High"])
df["LOAN_bins"] = loan_bins  # We add a new column as opposed to overwriting the existing column

# 3. Verify the binning by displaying the first few rows of the data frame and checking the LOAN feature.
print(df[["LOAN", "LOAN_bins"]].head(10))

In [None]:
# 4. Use the qcut function to discretize the MORTDUE feature into 4 quantile-based bins. Assign meaningful labels to each bin (e.g., "Q1", "Q2", "Q3", "Q4").
mortdue_bins = pd.qcut(df["MORTDUE"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])

# 5. Verify the binning by displaying the first few rows of the data frame and checking the MORTDUE feature.
df["MORTDUE_bins"] = mortdue_bins
print(df[["MORTDUE", "MORTDUE_bins"]].head(10))

In [None]:
# 6. Create a new data frame that includes the discretized LOAN and MORTDUE features along with the other original features.
df_discretized = df.drop(columns=["LOAN", "MORTDUE"])
df_discretized = pd.concat([df_discretized, loan_bins, mortdue_bins], axis=1)
df_discretized  # preview the data

In [None]:

# 7. Write a custom function that takes a data frame and a list of numerical features as input and returns a new data frame with all specified features discretized into a given number of bins using the cut function. Test your function on the numerical features in the data frame.
def discretize_features(df, features, bins=5, labels=None):
    """
    Discretizes the specified columns of a DataFrame into equal-width bins.

    Parameters:
    df (pandas.DataFrame): The DataFrame containing the data to be discretized.
    columns (list of str): The list of column names to be discretized.
    bins (int): The number of equal-width bins to use for discretization.

    Returns:
    pandas.DataFrame: A new DataFrame with the specified columns discretized into bins.

    """
    df_discretized = df.copy()
    for feature in features:
        df_discretized[feature + "_bins"] = pd.cut(df_discretized[feature], bins=bins, labels=labels)
    return df_discretized

# Test the function on the numerical features in the data frame
ix_numerical = df.select_dtypes(include="float32").columns
df_discretized_all = discretize_features(df, ix_numerical, bins=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
df_discretized_all.head()