# **03 - Correlation Analysis and Visualization**

## Objectives

* Analyze relationships between house attributes and the target variable `SalePrice` to identify key predictors for modeling.
* Validate hypotheses by performing correlation analysis and visualizing the results.
* Generate insights through visualization such as heatmaps, scatterplots and boxplots to demonstrate the strength and direction of relationships.

## Inputs

* Cleaned dataset: `outputs/datasets/cleaned/house_prices_cleaned.parquet`
* Hypotheses that can be found in [Hypotheses and Validation Process in README.md](LÄGG IN LÄNK!)

## Outputs

* Generate reusable code that answers **Business Requirement 1** by analyzing correlations and creating visualizations.
    * The code will also be used in the Streamlit app.
* Create and save data plots in `docs/plots` directory for use in Streamlit app.
* Identify and document the most relevant variables for the regression model based on the correlation analysis.

## Additional Comments

* The visualization in this notebook will also be used in the final dashboard to meet **Business Requirement 1**.

---

## Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Section 1: Load Data

Load the cleaned dataset from `outputs/datasets/cleaned/house_prices_cleaned.parquet` into DataFrames and display the first five rows of the dataset.

In [None]:
import pandas as pd

df = pd.read_parquet("outputs/datasets/cleaned/house_prices_cleaned.parquet")
df.head(5)

---

## Section 2: Data Exploration

We create a data profiling report for exploratory data analysis (EDA) of the DataFrame `df`.

In [None]:
from ydata_profiling import ProfileReport

profile_report = ProfileReport(df=df, minimal=True)
profile_report.to_notebook_iframe()

---

## Section 3: Correlation and PPS Analysis

Our dataset includes four categorical variables stored as `object` data types. To incorporate these variables into the correlation analysis, we use One Hot Encoding to convert them into numerical format.

In [None]:
from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)

print(df_ohe.shape)
df_ohe.head(5)

We define a set of functions to calculate and visualize the relationships between variables in the dataset. These functions generate heatmaps for correlation (Pearson and Spearman) and predictive power (PPS), offering insights into both linear, monotonic, and non-linear relationships. To maintain a clean and focused output, we suppress `FutureWarning` messages, which are not critical to the analysis but may clutter the console. The code ensures that the `docs/plots` directory exists, creating it if necessary, so that all generated plots are saved in an organized manner for easy access and future use. For clarity, heatmaps hide values that are 0 or less than 0.2, providing a cleaner and more interpretable visualization.

In [None]:
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import ppscore as pps
import warnings
%matplotlib inline

# Ignore FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Theme
sns.set_theme(style="darkgrid")

# Check and create the folder to save plots
def ensure_directory_exists(directory):
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Directory '{directory}' created.")

# Define correlation heatmap
def correlation_heatmap(df, threshold=0.5, figsize=(12, 8), font_size=8, title="Correlation Heatmap", save_path=None):
    """
    Generates a heatmap using correlation, filtering out weak correlations.
    """
    if df.shape[1] > 1:  # Check for enough columns
        # Filter rows and columns with values below the threshold
        filtered_data = df.loc[(abs(df) >= threshold).any(axis=1), (abs(df) >= threshold).any(axis=0)]

        # Create a mask to hide the upper triangle and values that are 0
        mask = np.zeros_like(filtered_data, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[(abs(filtered_data) <= 0.2)] = True

        formatted_data = filtered_data.applymap(lambda x: round(x, 2) if abs(x) > 0.2 else 0)

        # Draw heatmap
        plt.figure(figsize=figsize)
        sns.heatmap(
            formatted_data, 
            annot=True, 
            cmap=sns.color_palette("Spectral"),
            mask=mask, 
            annot_kws={"size": font_size}, 
            linewidths=0.5
        )
        plt.title(title)

        # Save heatmap
        if save_path:
            ensure_directory_exists(os.path.dirname(save_path))  # Ensure directory exists
            plt.savefig(save_path, bbox_inches='tight')
            print(f"Heatmap saved to {save_path}")
        
        plt.show()

# Define PPS heatmap
def pps_heatmap(df, threshold=0.2, figsize=(12, 8), font_size=8, title="PPS Heatmap", save_path=None):
    """
    Generates a heatmap for PPS matrices, filtering out weak predictive scores.
    """
    if df.shape[1] > 1:  # Check for enough columns
        # Filter rows and columns with values under the threshold
        filtered_data = df.loc[(abs(df) >= threshold).any(axis=1), (abs(df) >= threshold).any(axis=0)]

        # Create a mask to hide values <= 0.2 and values that are exactly 0
        mask = np.zeros_like(filtered_data, dtype=bool)
        mask[abs(filtered_data) <= 0.2] = True 
        mask[filtered_data == 0] = True

        formatted_data = filtered_data.applymap(lambda x: round(x, 2) if abs(x) > 0.2 else 0)

        # Draw heatmap
        plt.figure(figsize=figsize)
        ax = sns.heatmap(
            formatted_data, 
            annot=True, 
            cmap=sns.color_palette("Spectral"),
            annot_kws={"size": font_size}, 
            linewidths=0.5,
            mask=mask
        )
        # Remove axis titles
        ax.set_xlabel('')
        ax.set_ylabel('')

        plt.title(title)

        # Save heatmap
        if save_path:
            ensure_directory_exists(os.path.dirname(save_path))  # Ensure directory exists
            plt.savefig(save_path, bbox_inches='tight')
            print(f"Heatmap saved to {save_path}")
        
        plt.show()

### Correlation Analysis

We start by calculating the Pearson and Spearman correlations.

In [None]:
numeric_df = df.select_dtypes(include=[np.number])
spearman_corr = numeric_df.corr(method="spearman")
pearson_corr = numeric_df.corr(method="pearson")

Generate heatmaps for Spearman and Pearson methods, hiding values that are 0 or less than 0.2.

In [None]:
print("*** Spearman heatmap to evaluate monotonic relationships")
correlation_heatmap(
    df=spearman_corr,
    threshold=0.5,
    figsize=(20, 12),
    font_size=8,
    title="Spearman Correlation Heatmap",
    save_path="docs/plots/spearman_correlation_heatmap.png"
)
print("*** Pearson heatmap to evaluate linear relationships")
correlation_heatmap(
    df=pearson_corr,
    threshold=0.5,
    figsize=(20, 12),
    font_size=8,
    title="Pearson Correlation Heatmap",
    save_path="docs/plots/pearson_correlation_heatmap.png"
)

Calculates the Spearman correlation coefficients between the target variable `SalePrice` and all other columns in the `df_ohe` DataFrame, then sorts the results by absolute value in descending order, excluding the self-correlation of `SalePrice`.

In [None]:
spearman_corr = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
print("Top 10 Spearman correlations with SalePrice:")
spearman_corr.head(10)

Calculates the Pearson correlation coefficients between the target variable `SalePrice` and all other columns in the `df_ohe` DataFrame, then sorts the results by absolute value in descending order, excluding the self-correlation of `SalePrice`.

In [None]:
pearson_corr = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
print("\nTop 10 Pearson correlations with SalePrice:")
print(pearson_corr.head(10))

Combine the top 10 attributes from Pearson and Spearman and create a list for further analysis.

In [None]:
top_n = 10
vars_to_study = list(set(pearson_corr.head(top_n).index.to_list() + spearman_corr.head(top_n).index.to_list()))

print("\nVariables selected for further study:")
print(vars_to_study)

### PPS Matrix

Calculate PPS Matrix.

In [None]:
pps_matrix_raw = pps.matrix(df)
pps_matrix = pps_matrix_raw.pivot(index='y', columns='x', values='ppscore')

Show PPS statistics

In [None]:
pps_stats = pps_matrix_raw.query("ppscore < 1")['ppscore'].describe()
print("PPS Statistics:\n", pps_stats.round(3))

Generate heatmap for PPS Matrix, hiding values that are 0 or less than 0.2 and save plot to `docs/plots`.

In [None]:
print("*** PPS Matrix Heatmap to detect both linear and non-linear relationships.")
pps_heatmap(
    df=pps_matrix,
    threshold=0.2,
    figsize=(20, 12),
    font_size=8,
    title="PPS Heatmap",
    save_path="docs/plots/pps_heatmap.png"
)

---

## Section 4: Exploratory Data Analysis (EDA) on selected variables.

Combine the top 10 variables from Pearson and Spearman.

In [None]:
top_n = 10
vars_to_study = set(
    pearson_corr.head(top_n).index.to_list() +
    spearman_corr.head(top_n).index.to_list()
)

print("\nCombined top 10 variables for further analysis:")
vars_to_study

We create a new DataFrame (`df_eda`) containing the top attributes along with the target variable `SalePrice`. This makes it easier to use these variables in further analysis or modeling.

In [None]:
selected_features = list(vars_to_study) + ['SalePrice']
df_eda = df_ohe[selected_features]

df_eda.head()

We need to understand the distribution of the target variable (`SalePrice`) and identify whether it is skewed or contains outliers. This is important because skewness or outliers can affect the performance of predictive models.

We create a histogram with a KDE (Kernel Density Estimate) overlay to visualize the distribution of `SalePrice`.

In [None]:
def plot_target_hist(df, target_var):
    """
    Function to plot a histogram of the target variable with KDE overlay.
    """
    plt.figure(figsize=(12, 5))
    sns.histplot(
        data=df,
        x=target_var,
        kde=True,
        color=sns.color_palette("Spectral")[0]
    )
    plt.title(f"Distribution of {target_var}", fontsize=20)
    plt.xlabel(target_var, fontsize=14)
    plt.ylabel("Frequency", fontsize=14)
    plt.savefig(f'docs/plots/hist_plot_{target_var}.png', bbox_inches='tight')        
    plt.show()

# Analyze the distribution of SalePrice
plot_target_hist(df_eda, 'SalePrice')

To address **Business Requirement 1**, which is to discover how house attributes correlate with the sale price. This involves performing a bivariate analysis to examine the relationship between each variable in `vars_to_study` and the target variable `SalePrice`.

We create three types of visualizations, which help us answer **Business Requirement 1**:
* **Linear regression plots** for continuous variables.
* **Boxplots** for categorical variables.
* **Line plots** for time variables.

The function `create_visualizations` automates the process of visualization by iterating through all variables in `vars_to_study` and selecting the appropriate visualization based on the variable type.

In [None]:
sns.set_style('darkgrid')

# Time variables
time = ['YearBuilt', 'YearRemodAdd']

def plot_lm(df, col, target_var):
    """
    Function to create linear regression plots for continuous variables using the Spectral palette.
    """
    plt.figure(figsize=(10, 6))
    scatter = plt.scatter(
        x=df[col], 
        y=df[target_var], 
        c=df[col], 
        cmap='Spectral', 
        alpha=0.7, 
        edgecolor='k'
    )
    sns.regplot(
        data=df, 
        x=col, 
        y=target_var, 
        scatter=False,
        line_kws={'color': 'black'} 
    )

    cbar = plt.colorbar(scatter)
    cbar.set_label(f"{col}", fontsize=12)

    plt.title(f"Linear Regression Plot: {col} vs {target_var}", fontsize=14)
    plt.xlabel(col, fontsize=8)
    plt.ylabel(target_var, fontsize=8)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.savefig(f'docs/plots/lm_plot_price_by_{col}.png', bbox_inches='tight')
    plt.show()

def plot_box(df, col, target_var):
    """
    Function to create box plots for categorical variables.
    Dynamically adjusts the palette based on the number of categories.
    """
    num_categories = len(df[col].unique()) 
    palette = sns.color_palette("Spectral", n_colors=num_categories)

    plt.figure(figsize=(10, 4))
    sns.boxplot(
        data=df, 
        x=col, 
        y=target_var, 
        palette=palette
    ) 
    plt.title(f"{col}", fontsize=12)
    plt.xlabel(col, fontsize=8)
    plt.ylabel(target_var, fontsize=8)
    plt.savefig(f'docs/plots/box_plot_price_by_{col}.png', bbox_inches='tight')
    plt.show()

def plot_line(df, col, target_var):
    """
    Function to create line plots for time variables.
    """
    plt.figure(figsize=(10, 4))
    sns.lineplot(
        data=df, 
        x=col, 
        y=target_var, 
        color=sns.color_palette("Spectral")[1]
    )
    plt.title(f"{col}", fontsize=12)
    plt.xlabel(col, fontsize=8)
    plt.ylabel(target_var, fontsize=8)
    plt.savefig(f'docs/plots/line_plot_price_by_{col}.png', bbox_inches='tight')        
    plt.show()

# Loop for visualizations
def create_visualizations(df, vars_to_study, target_var):
    """
    Loop through variables and create appropriate visualizations.
    """
    for col in vars_to_study:
        if len(df[col].unique()) <= 10:  # Categorical variables
            plot_box(df, col, target_var)
            print(f"*** Boxplot created for {col}\n\n")
        elif col in time:  # Time variables
            plot_line(df, col, target_var)
            print(f"*** Line plot created for {col}\n\n")
        else:  # Continuous variables
            plot_lm(df, col, target_var)
            print(f"*** Linear regression plot created for {col}\n\n")

# Call the function to create visualizations
create_visualizations(df_eda, vars_to_study, 'SalePrice')

## Conclusions and Next Steps

### Conclusions
We successfully completed the correlation analysis and visualization process, identifying key predictors for modeling and validating hypotheses related to house prices.

#### Key observations include
1. **Size Matters**: Larger properties, as indicated by variables like `1stFlrSF`, `GrLivArea`, `TotalBsmtSF`, and `GarageArea`, are associated with higher sale prices.
2. **Time Matters**: Recently built houses (`YearBuilt`) and houses with recent remodels (`YearRemodAdd`) tend to have higher sale prices.
3. **Quality Matters**: Higher overall quality (`OverallQual`) and kitchen quality (`KitchenQual`) ratings are strongly correlated with higher sale prices.

All visualizations were saved in the `docs/plots` directory for further use in the Streamlit app and to meet **Business Requirement 1**.

### Next steps: Feature Engineering
1. **Handle outliers**: Identify and address outliers in key variables like `GrLivArea`, `TotalBsmtSF`, and `GarageArea` to improve model robustness.
2. **Transform Variables**: Apply log transformation to `SalePrice` and other skewed variables to enhance linearity.
3. **Create New Features**: Combine existing variables to create new features.
4. **Scale Data**: Standardize or normalize numerical variables to ensure consistent scaling for modeling.