# Ames Housing Step-by-step - Exercise 4

Pieter Overdevest  
2024-02-09

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

### How to work with this Jupyter Notebook yourself?

- Get a copy of the repository ('repo') [machine-learning-with-python-explainers](https://github.com/EAISI/machine-learning-with-python-explainers) from EAISI's GitHub site. This can be done by either cloning the repo or simply downloading the zip-file. Both options are explained in this Youtube video by [Coderama](https://www.youtube.com/watch?v=EhxPBMQFCaI).

- Copy the folders 'ames_housing_pieter\' and 'utils_pieter\' folder to your own project folder.

### Import packages

In [None]:
# Load packages and assign to a shorter alias.
import pandas as pd
import numpy as np

# Pieter's utils package.
import utils_pieter as up

## Exercise 4 - Explore the outcome variable (`SalePrice`) and how it correlates to other variables

### Data Understanding (continued)

Explore the outcome variable 'SalePrice', representing the sale price of the homes.

#### a. Conduct descriptive/summary statistics on the Y variable (mean, median, std, range)

Since we use the SalePrice data multiple times in this notebook, we assign it (Pandas Series) to an object, which we call, `ps_y_reduced`. What does it mean for the shape of the distribution that the mean exceeds the median?

In [None]:
ps_y_reduced = df_reduced['SalePrice']

print("Summary statistics of sale price (Y):")
print(f"Mean:   {round(ps_y_reduced.mean(), 1)}")
print(f"Median: {round(ps_y_reduced.median(), 1)}")
print(f"Std:    {round(ps_y_reduced.std(), 1)}")
print(f"Range:  {ps_y_reduced.min()} till {ps_y_reduced.max()}")

#### b. Plot the distribution of the Y variable. What do we observe?

In this section we use the Vega-Altair library to create visualisations. Vega-Altair is a declarative visualization library for Python, meaning you declare **what** you want to see, instead of providing a step-by-step procedure for detailing how the desired visualisation should be achieved, i.e., imperative programming. The [Getting Started](https://altair-viz.github.io/getting_started/overview.html#overview), [Example Gallery](https://altair-viz.github.io/gallery/index.html#example-gallery), and [Tutorial Exploring Seattle Weather](https://altair-viz.github.io/case_studies/exploring-weather.html) pages on their website are a good place to start. When clicking on an example in the gallery, you will be taken to the source code for that example.

In [None]:
# Import package.
import altair as alt

Typically, data distributions are plotted using histograms. We use a [bar chart](https://altair-viz.github.io/gallery/simple_bar_chart.html) to create a histogram of `SalePrice`. The count() function counts the number of rows in each category. The left panel shows the result with the default bin setting. The middle panel shows the result of setting a fixed bin width of $40,000 and the right panel show the result of setting a maximum number of bins to 16.

In [None]:
alt1 = alt.Chart(data=df_reduced).mark_bar().encode(
    x = 'SalePrice',
    y = 'count()'
)

alt2 = alt.Chart(data=df_reduced).mark_bar().encode(
    alt.X('SalePrice', bin=alt.Bin(step=40000)),
    alt.Y('count()', title='Count')
)

alt3 = alt.Chart(data=df_reduced).mark_bar().encode(
    alt.X('SalePrice', bin=alt.Bin(maxbins=16)),
    alt.Y('count()', title='Count')
)

alt1 | alt2 | alt3

We observe that the data has a right-skewed distribution. This explains why the mean `SalePrice` is larger than its median. This is a phenomenon that occurs more often in nature and society, where a variable cannot be negative and does not have a maximum on the positive side. Yearly income is another example.

##### Extra - How to obtain an estimation for bin width and bin number?

Bin width and bin number are a means to steer the granularity of the histogram. What is the downside of having too few and of having too many bins?

In the middle panel in the figure above, we assumed a bin width of $40,000. Given the distribution of the observed data we can calculate a bin width using the [Freedman–Diaconis rule](https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule):

$$
bin\,width = \frac{2 * IQR(x)} {\sqrt[3]{x}}
$$

where, IQR(*x*) is the InterQuartile Range of sample *x*, and *n* is number of observations in the sample. The number of bins can be derived simply by dividing the range by the bin width.

In [None]:
# The IQR is equal to the difference between the 75th and 25th percentiles.
n_q25, n_q75 = np.percentile(ps_y_reduced, q = [25,75])

n_IQR = n_q75 - n_q25

# The bin width is calculated using the Freedman-Diaconis rule, see above.
n_bin_width = 2 * n_IQR / (len(ps_y_reduced)**(1/3))

# The number of bins easily follows from the range and the bin width.
n_bins = int((max(ps_y_reduced) - min(ps_y_reduced))/n_bin_width)

print(f"Freedman–Diaconis bin width:      {n_bin_width:,.0f}")
print(f"Freedman–Diaconis number of bins: {n_bins}")
print(f"\nThis also means that 50% of the houses were sold between ${n_q25:,.0f} (n_q25) and ${n_q75:,.0f} (n_q75).")

#### c. Investigate how `Gr Liv Area` (numerical) relates to the Y variable. Tip: see Altair's [scatter plot](https://altair-viz.github.io/gallery/scatter_tooltips.html).

In this section, we investigate how the `Gr Liv Area` variable - as example of a numerical variable - and the `SalePrice` variable (response) relate to each other.

In [None]:
# Set the numerical feature we want to investigate.
c_num = "Gr Liv Area"

##### c1. Scatter plot of `Gr Liv Area` against `SalePrice`

We use a [scatter plot](https://altair-viz.github.io/gallery/scatter_tooltips.html) to plot the sale price against the grand living area for each house that was sold.

In [None]:
# Create scatter plot - Part I
alt.Chart(df_reduced).mark_circle().encode(
    x = c_num,
    y = 'SalePrice'
)

Let's tune the same figure a bit..

In [None]:
# Create scatter plot - Part II
alt.Chart(df_reduced).mark_circle(size=60).encode(
    x       = c_num,
    y       = 'SalePrice',
    color   = 'Neighborhood',
    tooltip = [c_num, 'SalePrice', 'Neighborhood']
).properties(
    height = 400,
    width  = 600
).configure_axis(
    labelFontSize = 15,
    titleFontSize = 15
).configure_mark(
    opacity = 0.5
).interactive()

In [None]:
# Define the theme by returning the dictionary of configurations.
# https://altair-viz.github.io/user_guide/customization.html
def f_theme_altair():

    return {
        'config': {
            'view': {
                'height': 400,
                'width':  600,
            },
            'axis': {
                'labelFontSize': 15,
                'titleFontSize': 15,
            }
        }
    }

# Register the custom theme under a chosen name.
alt.themes.register('my_theme_altair', f_theme_altair)

# Enable the newly registered theme.
alt.themes.enable('my_theme_altair');

# If you want to restore the default theme, use:
#alt.themes.enable('default');

Now, we can reproduce the figure above with less code:

In [None]:
# Create scatter plot - Part III
alt.Chart(df_reduced).mark_circle(size=60).encode(
    x       = c_num,
    y       ='SalePrice',
    color   = 'Neighborhood',
    tooltip = [c_num, 'SalePrice', 'Neighborhood']
).configure_mark(
    opacity = 0.5
)#.interactive()

To get a better appreciation for the data density and the general trend, we use a [2D histogram heatmap](https://altair-viz.github.io/gallery/histogram_heatmap.html) to bin the data into a grid and display the count of each bin.

In [None]:
alt.Chart(df_reduced).mark_rect().encode(
    alt.X(c_num).bin(step=100),
    alt.Y('SalePrice').bin(step=25000),
    alt.Color('count()').scale(scheme='greenblue')
)#.interactive()

##### c2. Faceted scatter plot of `Gr Liv Area` against `SalePrice`

Looking at the length of the legend and the broad spectrum of colours in the scatter plots above, it is clear that this approach is not very informative beyond a general trend. In the next plot we make use of the `facet()` method. Reducing the alpha helps to understand the distribution of the data and whether there really is a trend or not. By default - with alpha equal to 1 - ten overlapping data points give the same impression as one data point.

The figure below suggests that houses sold in the more expensive areas - like 'NridgHt' and 'StoneBr' - have a higher sale price per square feet of `Gr Liv Area`, than in the less expensive areas, like 'Sawyer' and 'SWISU'.

In [None]:
alt.Chart(df_reduced).mark_circle().encode(
    x = c_num,
    y = 'SalePrice',
).properties(
    height = 200,
    width  = 200
).facet(
    'Neighborhood',
    columns = 5
)#.interactive()

#### d. Investigate how `Neighborhood` (categorical) relates to the Y variable. Tip: see Altair's [histogram](https://altair-viz.github.io/gallery/simple_histogram.html) and [boxplot](https://altair-viz.github.io/gallery/boxplot.html).

In this section, we investigate how the `Neighborhood` variable - as example of a categorical variable - and the `SalePrice` variable (response) relate to each other.

In [None]:
# Set the categorical feature we want to investigate.
c_cat = "Neighborhood"

##### d1. Number of neighborhoods

Before we look at the relation between neighborhoods and saleprice, let's see how many houses have been sold in each neighborhood. We will plot these numbers in a histogram, ordering the neighborhoods by the median saleprice in the respective neighborhood.

For that we create a list of the 28 neighborhoods ordered by the median saleprice in the respective neighborhoods. The list suggests that MeadowV has the lowest median value for `SalePrice` and StoneBr has the highest median value for `SalePrice`.

In [None]:
l_cat_ordered = (

    df_reduced
    .groupby([c_cat])
    ['SalePrice']
    .median()
    .sort_values()
    .index
    .tolist()
)

print(l_cat_ordered, "\n")
print(f"{c_cat} with lowest median saleprice:  {l_cat_ordered[0]}")
print(f"{c_cat} with highest median saleprice: {l_cat_ordered[-1]}")

We use a [bar chart](https://altair-viz.github.io/gallery/simple_bar_chart.html) and use the count() function to count the number of rows in each category.

In [None]:
alt.Chart(data=df_reduced).mark_bar().encode(
    x=alt.X(c_cat, sort=l_cat_ordered),
    y=alt.Y('count()', title = 'Count')
)

We observe no clear relation between the neighborhood and the number of houses sold in said neighborhood. In NAmes close to 450 houses were sold, where in Landmark and GrnHill hardly any houses were sold.

##### d2. Distribution of Y variable per neighborhood

We use the facet() method to plot the distribution of the Sale Price in each neighborhood.

In [None]:
alt.Chart(df_reduced).mark_bar().encode(
    x=alt.X(
        'SalePrice',
        bin  = alt.Bin(step=50000),
        axis = alt.Axis(format='~s')
    ),
    y=alt.Y(
        'count()',
        title = 'Count'

    )
).properties(
    height = 150,
    width  = 150
).facet(
    c_cat,
    columns = 4
).resolve_scale(
    y = 'independent' 
)

We observe different distributions of sale price among the neighborhoods. In NAmes and Saywer the distribution is narrower than in other neighborhoods. In StoneBr and NridgHt the distribution is wider than in other neighborhoods.

##### d3. Box plot of Y variable per neighborhood

[Box plots](https://altair-viz.github.io/gallery/boxplot.html) allow for a different - more compact - representation of distributions.  As before, the neighborhoods are ordered by `l_cat_ordered`. As expected, we observe the median value - the horizontal line in the middle of each box - goes up steadily from left to right.

In [None]:
alt.Chart(data=df_reduced).mark_boxplot().encode(
    x = alt.X(c_cat, sort=l_cat_ordered),
    y = alt.Y('SalePrice'),
)

##### d4. Strip plot of Y variable per Neighborhood

To give another way to visualise the distribution, we can use the [strip plot](https://altair-viz.github.io/gallery/strip_plot.html).

In [None]:
alt.Chart(df_reduced).mark_tick().encode(
    x = alt.Y(c_cat, sort=l_cat_ordered),
    y = alt.X('SalePrice')
)

### Data Preparation (continued)

#### e. Assess the distribution of `SalePrice` in exercise 4b. What did you observe? What does it mean for the performance of the prediction model? Log-transform the outcome variable.

We observed that the `SalePrice` distribution is right-skewed. This means there is no symmetric distribution around the estimated values. This causes the expensive homes in Q4 to 'pull' the predictions to higher values, than the homes in Q1 can prevent. To solve this, we apply log transformation making the distribution more like a normal (symmetric) distribution.

In [None]:
# We create 'log' brother of ps_y_reduced.
ps_y_reduced_log = np.log(ps_y_reduced)

# Add ps_y_log to df_reduced.
df_reduced['SalePrice_log'] = ps_y_reduced_log

# Add 'SalePrice_log' to l_df_num_names.
l_df_num_names.append('SalePrice_log')

# Let's see the distribution ps_y_log. Looks much better!
alt1 = alt.Chart(data=df_reduced).mark_bar().encode(
    alt.X('SalePrice', bin=alt.Bin(step=12000)),
    alt.Y('count()', title='Count')
).properties(
    title='Original SalePrice distribution'
)

alt2 = alt.Chart(data=df_reduced).mark_bar().encode(
    alt.X('SalePrice_log', bin=alt.Bin(step=0.1)),
    alt.Y('count()', title='Count')
).properties(
    title='Log-transformed SalePrice distribution'
)

alt1 | alt2

Indeed, we observe the log-transformed SalePrice to have a more symmetric normal distribution.

#### f. Assess `Gr Liv Area` for all houses in the previous exercise. What do you observe? Remove outliers. What does it mean for the scope of the prediction model?

In exercise 4d, we observed five houses that have an exceptionally large grand living area (`Gr Liv Area`), while three of them even have a relatively low sale price. We remove all houses with `Gr Liv Area` exceeding 4,000 sq ft, and limit the scope of the model to houses having `Gr Liv Area` up to 4,000 sq ft.

We proceed the remainder of the analysis with `df_scoped`. It is good practice to assign a new variable name to the new object, since it concerns a significant change. This keeps `df_reduced` available for later reference.  Of course, you cannot create new objects for each step/change, you will need to find the right balance, between RAM and clarity.

In [None]:
# Remove observations with 'Gr Liv Area' exceeding 4000 sq ft. In addition,
# we reset the index to prevent issues later on in case of merging data.
df_scoped = df_reduced.query("`Gr Liv Area` < 4000").reset_index(drop=True)

# We create a reduced version of ps_y_log.
ps_y_scoped_log = df_scoped.SalePrice_log
ps_y_scoped     = df_scoped.SalePrice

# And we create reduced versions of the numerical and categorical data.
df_scoped_num = df_scoped.select_dtypes(include='number').reset_index(drop=True)
df_scoped_cat = df_scoped.select_dtypes(include='category').reset_index(drop=True)

print(
    f"We proceed the analysis with {df_scoped.shape[0]} observations out "
    f"of the {df_reduced.shape[0]} observations in the original dataset.\n"
    f"The data frame has {df_scoped.shape[1]} columns, of which "
    f"{df_scoped_num.shape[1]} are numerical and "
    f"{df_scoped_cat.shape[1]} are categorical."
)

In [None]:
alt1 = alt.Chart(df_reduced).mark_circle().encode(
    x = c_num,
    y = 'SalePrice'
).properties(
    title='Original data (df_reduced)'
)

alt2 = alt.Chart(df_scoped).mark_circle().encode(
    x = c_num,
    y = 'SalePrice'
).properties(
    title='Scoped data (df_scoped)'
)

alt1 | alt2

### Data Understanding (continued)

#### g. Draw scatter plots between Y and each of the numerical features. Tip: see Altair's [scatter plot](https://altair-viz.github.io/gallery/scatter_tooltips.html)

Recall that we created a list object `l_df_num_names` containing all numerical variable names in the data. We will use this object to create scatter plots of each numerical variable against the SalePrice. We define `l_df_X_names` as subset of `l_df_num_names` containing all variables except `SalePrice` and `SalePrice_log`. To accomplish that we make use of a list comprehension ([RealPython](https://realpython.com/list-comprehension-python/)), see also the Python Explainer 'List Comprehensions'.

In [None]:
l_df_X_names = [x for x in l_df_num_names if x not in ['SalePrice', 'SalePrice_log']]

print(len(l_df_X_names), l_df_X_names)

We define a function to create a 'chart row' (i.e., a row of charts) using the [repeat() method](https://altair-viz.github.io/user_guide/compound_charts.html#repeated-charts). As input arguments it takes a data frame and a list of variable names present in the data frame. The respective variables are plotted against `SalePrice_log`. To get a better understanding of the position of the majority of the data, it helps to set a low opacity value.

In [None]:
def f_create_chart_row(df_data, l_col):

    return (
        alt.Chart(df_data)
        .mark_point(opacity=0.1)
        .encode(
            x=alt.X(alt.repeat("column"), type='quantitative'),
            y=alt.Y(alt.repeat("row"), type='quantitative')
        )
        .properties(width=200, height=200)
        .repeat(
            column = l_col,
            row    = ['SalePrice_log']            
        )
    )

We create a list of chart rows, each chart row contains `n_col` charts. We create as many chart rows until each of the 38 charts has its place in a chart row. In the example below we set `n_col` to four. This means that we end up with ten chart rows. The first nine chart rows each hold four charts. The tenth chart row holds the remaining two charts.

In [None]:
# Number of charts per row of chart row.
n_col = 4

# Create list of chart rows (i.e., row of charts) each having n_col charts.
l_chart_row = [
    
    # Create chart row.
    f_create_chart_row(
        df_data = df_scoped,
        l_col   = l_df_X_names[i:i+n_col]
    )
    
    for i in range(0, len(l_df_X_names), n_col)
]

# To explain what 'range(0, len(l_df_X_names), n_col))' holds:
print(list(range(0, len(l_df_X_names), n_col)))

# First item in 'l_chart_row':
l_chart_row[0]

Now, we can plot the 38 charts on a canvas. Note, the '*' operator is used to unpack a list.

In [None]:
#alt.vconcat(*l_chart_row)

We observe that `Overall Qual`, `Garage Cars`, and `Garage Area` seem to correlate well with `SalePrice`. The variable `Total Bsmt SF` also correlates well with `SalePrice`, while we also see some outliers that may cause the trendline to become shallower. Also with other 'SF' variables we observe outliers; something to keep in mind for Data Preparation. What does it mean for the prediction model that some variables correlate well with the outcome variable Y?

In case you want to include a trendline in each of the scatterplots, check out the function below, which is part of the utils_pieter package. The .repeat() method does not seem to work together with the .transform_regression() method, see scrapyard.py in the example-solutions\archive\ folder. So, we create a list of 38 single charts (incl. trendline) using f_plot_scatter_with_trend(). Then, the list is converted to a list of row charts, which is plotted using vconcat() and hconcat() using f_plot_scatter_with_trend_grid().

In [None]:
# up.f_plot_scatter_with_trend_grid(
#     df_input = df_scoped,
#     l_x      = l_df_X_names,
#     c_y      = 'SalePrice_log',
#     n_col    = 4
# )

#### h. Create a table showing the Pearson correlation coefficients between Y and each of the numerical variables. Tip: see [pearsonr()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).

The Pearson correlation coefficients between the outcome variable (Y) and each of the numerical variables quantifies how well they correlate to each other.

The Pearson correlation coefficient - named after [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson) - quantifies the strength of the linear relationship between two numerical data samples. To understand the calculation of Pearson's correlation coefficient we construct the calculation using the formula below ([ref](https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/)) and we compare it to the outcome of the `pearsonr()` function. The Pearson coefficient is calculated as,

$$
Pearson\,correlation\,coefficient = \frac{covariance(X, Y)} {std(X) * std(Y)}
$$

For correlation between categorical and numerical variables you can use ANOVA or the Point Biseral Test ([ref](https://www.tutorialspoint.com/correlation-between-categorical-and-continuous-variables)). For correlation between categorical variables you can use Cramer's V (symmetrical) or Theil's U (asymmetrical) ([ref](https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9)).

##### h1. - How is the Pearson correlation coefficient calculated?


We will work through an example where we calculate the Pearson correlation coefficients between the variables `SalePrice_log` and `Overall Qual`, among others.

In [None]:
# Import module.
from scipy.stats import pearsonr

# Initialize.
c_variable = "Overall Qual"
#c_variable = "Gr Liv Area"
#c_variable = "Garage Cars"

# Extract variable data from the data frame.
ps_variable    = df_scoped[c_variable]

# Calculate covariance matrix.
m_cov          = np.cov(ps_variable, ps_y_scoped_log)

# Covariance matrix - Calculate Pearson correlation coefficient between two variables.
n_corr_cov_mat = round(m_cov[0,1] / (np.std(ps_y_scoped_log) * np.std(ps_variable)), 3)

# Pearsonr() - Calculate Pearson correlation coefficient between two variables.
n_corr_pearson = round(pearsonr(ps_y_scoped_log, ps_variable)[0], 3)

print(f"Pearson correlation coefficient between variable '{c_variable}' and 'SalePrice_log', calculated using:")

print(f"Covariance matrix-based formula: {n_corr_cov_mat}")

print(f"Python built-in function:        {n_corr_pearson}")


##### h2. Calculate Pearson correlation coefficients between all numerical variables, incl Y.

In [None]:
l_pearson_corr = [

    round(pearsonr(ps_y_scoped_log, df_scoped[x])[0], 3)

    for x in l_df_num_names
]

The object `l_pearson_corr` contains the Pearson correlation coefficients between `SalePrice_log` and each numerical variable. We will use it to construct a data frame sorted by the absolute value of the Pearson correlation coefficient in descending order.

In [None]:
df_corr_table = (

    pd.DataFrame({'name': l_df_num_names, 'corr': l_pearson_corr})
    .assign(corr_abs = lambda row: abs(row['corr']))
    .sort_values(by = 'corr_abs', ascending=False)
)

df_corr_table.head(10)

We kept `SalePrice` and `SalePrice_log` in the data, so we could observe what correlation values would be calculated. Why do we observe 1 for `SalePrice_log`? and 0.95 for `SalePrice`?

We observe that - of the other variables - `Overall Qual` has the highest correlation with `SalePrice_log`. What does this mean in case of a model predicting `SalePrice_log`?

#### i. Create correlation plots showing the correlations between each pair of numerical variables, incl. Y. Tip: see Seaborn's [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) and [Fritz' Blog](https://fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations/).

Heatmaps are a useful way to visualize correlations. Seaborn has a straightforward solution for plotting heatmaps of correlations.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

##### i1. Plot correlation heatmap for all numerical variables.

To show the benefit of this approach, we plot a heatmap of all correlations among all numerical variables. We can identify which pairs have high and which pairs have low correlation.

In [None]:
plt.figure(figsize=(10, 10)) 

sns.heatmap(
    data   = df_scoped_num.corr(),
    annot  = False,
    square = True,
    cmap   = 'coolwarm'
);

##### i2. Plot correlation heatmap for Top-10 numerical variables having the highest correlation with Y.

In [None]:
plt.figure(figsize=(10, 10)) 

sns.heatmap(
    data      = df_scoped_num[df_corr_table.head(10)['name']].corr(),
    annot     = True,
    square    = True,
    annot_kws = {"size": 12},
    cmap      = 'coolwarm'
);

In addition to the confirmation of the correlation between `SalePrice_log` and the other numerical variables, we also observe that `Garage Area` and `Garage Cars` are highly correlated. What does this mean in terms of our model?

##### i4. Plot correlation heatmap for the numerical variables having 'SF' in the variable name

Let's look at another way of subsetting the numerical variables. Suppose we want to select only those variables that contain 'SF' ('square feet').

In [None]:
l_df_num_names_sf = [x for x in l_df_num_names if "SF" in x or x == 'SalePrice_log']

plt.figure(figsize=(10, 10)) 

sns.heatmap(
    data      = df_scoped_num[l_df_num_names_sf].corr(),
    annot     = True,
    square    = True,
    annot_kws = {"size": 12},
    cmap      = 'coolwarm'
);

The heatmap shows that `Total Bsmt SF` and `1st Flr SF` are highly correlated. What does this mean in terms of our model?

Note, I made a function `f_heatmap()`, see `utils_pieter` package to see the source code. Feel free to copy paste the source code into your notebook and check out how it works.  See also Altair's [simple heatmap plot](https://altair-viz.github.io/gallery/simple_heatmap.html) and [Annual Weather Heatmap example](https://altair-viz.github.io/gallery/annual_weather_heatmap.html).



In [None]:
from utils_pieter import f_heatmap

In [None]:
f_heatmap(    
    df_input      = df_scoped_num,
    l_df_names    = df_corr_table.head(10)['name'],
    b_add_corr    = True,
    n_font_size   = 14,
    n_canvas_size = 400
)