# **(ADD THE NOTEBOOK NAME HERE)**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Backup

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df is your original dataframe
df_missing = df.copy()
df_missing['BsmtExposure'] = df_missing['BsmtExposure'].fillna('Missing')

# One-hot encode BsmtExposure
df_encoded = pd.get_dummies(df_missing, columns=['BsmtExposure'], drop_first=True)

# Calculate the correlation matrix
correlation_matrix = df_encoded.corr()

# Extract the correlations with SalePrice
correlation_with_saleprice = correlation_matrix['SalePrice'].drop('SalePrice')

# Plot the correlations with SalePrice
plt.figure(figsize=(10, 6))
sns.barplot(x=correlation_with_saleprice.values, y=correlation_with_saleprice.index, palette="viridis")
plt.title('Correlation of BsmtExposure Encoded Variables with SalePrice')
plt.xlabel('Correlation coefficient')
plt.ylabel('Encoded Variables')
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr

def correlation_to_sale_price_spearman(df, vars_to_study):
    """ Joint plots of variables vs SalePrice with Spearman correlation annotation """
    target_var = 'SalePrice'
    
    for col in vars_to_study:
        # Calculate Spearman correlation
        spearman_corr, p_value = spearmanr(df[col], df[target_var])
        
        # Create scatter plot with regression line
        g = sns.lmplot(data=df, x=col, y=target_var, line_kws={'color': 'red'})
        
        # Set the title and labels
        g.set_axis_labels(col, target_var, fontsize=15)
        g.fig.suptitle(f"{col} (Spearman: {spearman_corr:.2f}, p-value: {p_value:.2e})", fontsize=20, y=1.05)
        
        plt.show()
        print("\n\n")

In [None]:
def correlation_to_sale_price_joint(df, vars_to_study):
    """  Joint plots of variables vs SalePrice """
    target_var = 'SalePrice'
    for col in vars_to_study:
        x, y, hue = col, target_var, 'OverallQual'
        sns.jointplot(data=df, x=x, y=y, kind='hex')
        # sns.jointplot(data=df, x=x, y=y, hue=hue)
        plt.title(f"{col}", fontsize=20, y=1.3, x=-3)
        plt.show()
        print("\n\n")


correlation_to_sale_price_joint(df_eda, vars_to_study)

In [None]:
def correlation_to_sale_price_scat(df, vars_to_study):
    """  scatterplots of variables vs SalePrice """
    target_var = 'SalePrice'
    for col in vars_to_study:
        fig, axes = plt.subplots(figsize=(8, 5))
        axes = sns.scatterplot(data=df, x=col, y=target_var, hue='OverallQual')
        plt.title(f"{col}", fontsize=20, y=1.05)
        plt.show()
        print("\n\n")

correlation_to_sale_price_scat(df_eda, vars_to_study)

In [None]:
def correlation_to_sale_price_lm(df, vars_to_study):
    """  Joint plots of variables vs SalePrice """
    target_var = 'SalePrice'
    for col in vars_to_study:
        # fig, axes = plt.subplots(figsize=(8, 5))
        sns.lmplot(data=df, x=col, y=target_var)
        plt.title(f"{col}", fontsize=20, y=1.05)
        plt.show()
        print("\n\n")

correlation_to_sale_price_lm(df_eda, vars_to_study)

In [None]:
def correlation_to_sale_price_hist(df, vars_to_study):
    """ Display correlation plot between variables and sale price """
    target_var = 'SalePrice'
    for col in vars_to_study:
        fig, axes = plt.subplots(figsize=(8, 5))
        axes = sns.histplot(data=df, x=col, y=target_var)
        plt.title(f"{col}", fontsize=20, y=1.05)
        plt.show()
        print("\n\n")

correlation_to_sale_price_hist(df_eda, vars_to_study)

In [None]:
non_integer_values_dict = {}

for column in df.columns:
    # Check if all values in the column are integers
    if not df[column].apply(lambda x: isinstance(x, int)).all():
        # Collect non-integer values, filtering out floats that don't start with '0.'
        non_integer_values = df[column][~df[column].apply(lambda x: isinstance(x, int))]
        non_integer_values = non_integer_values[~non_integer_values.apply(lambda x: isinstance(x, float) and not str(x).startswith('0.'))]
        # Use a set to ensure uniqueness
        unique_non_integer_values = set(non_integer_values)
        non_integer_values_dict[column] = list(unique_non_integer_values)

# Print the results
for column, values in non_integer_values_dict.items():
    print(f"Non-integer values in {column}: {values}")

In [None]:
import matplotlib.pyplot as plt

# Fetch the top scores
pps_topscores = pps_matrix.iloc[19].sort_values(key=abs, ascending=False)[1:11]

# Print the values
print(pps_topscores)

# Plot the bar chart
plt.bar(x=pps_topscores.index, height=pps_topscores)
plt.xticks(rotation=90)
plt.title("Predictive Power Score", fontsize=20, y=1.05)

# Annotate the bars with the values
for index, value in enumerate(pps_topscores):
    plt.text(index, value, f'{value:.2f}', ha='center', va='bottom')

plt.show()

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df is your original dataframe
df_missing = df.copy()
df_missing['BsmtExposure'] = df_missing['BsmtExposure'].fillna('Missing')
df_missing['BsmtFinType1'] = df_missing['BsmtFinType1'].fillna('Missing')

# Calculate the mean SalePrice for each BsmtExposure category
mean_saleprice = df_missing.groupby('BsmtExposure')['SalePrice'].mean().reset_index()

# Pivot the dataframe for the heatmap
pivot_table = mean_saleprice.pivot("BsmtExposure", "SalePrice", "SalePrice")

# Create the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt=".1f", cmap="YlGnBu")
plt.title('Average Sale Price by BsmtExposure')
plt.show()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr, linregress
from statsmodels.nonparametric.smoothers_lowess import lowess

# Sample mapping for KitchenQual
kitchen_qual_mapping = {'Ex': 4, 'Gd': 3, 'TA': 2, 'Fa': 1}

# Transform the KitchenQual values to numerical values in the DataFrame
df['KitchenQual_num'] = df['KitchenQual'].map(kitchen_qual_mapping)

# Display the count of each category in KitchenQual
print(df['KitchenQual'].value_counts())

# Function to plot a variable against SalePrice with Pearson and Spearman trendlines
def plot_with_trendlines(df, vars, target='SalePrice'):
    num_vars = len(vars)
    plt.figure(figsize=(16, 6 * num_vars))
    
    for i, var in enumerate(vars, 1):
        x = df[var]
        y = df[target]
        
        # Pearson correlation
        pearson_coef, _ = pearsonr(x, y)
        slope_pearson, intercept_pearson, _, _, _ = linregress(x, y)
        line_pearson = slope_pearson * x + intercept_pearson
        
        # Spearman correlation
        spearman_coef, _ = spearmanr(x, y)
        lowess_smoothed = lowess(y, x, frac=0.3)
        
        # Plotting
        plt.subplot(num_vars, 1, i)
        sns.scatterplot(x=x, y=y, label='Data points')
        
        plt.plot(x, line_pearson, color='red', label=f'Pearson trendline (r={pearson_coef:.2f})')
        plt.plot(lowess_smoothed[:, 0], lowess_smoothed[:, 1], color='blue', label=f'Spearman trendline (r={spearman_coef:.2f})')
        
        plt.xlabel(var)
        plt.ylabel(target)
        plt.title(f'{var} vs {target} with Pearson and Spearman Trendlines')
        plt.legend()
    
    plt.tight_layout()
    plt.show()

# Example usage for multiple variables, including transformed KitchenQual
variables = ['YearBuilt', 'OverallQual', 'KitchenQual_num']
plot_with_trendlines(df, variables)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr
from statsmodels.nonparametric.smoothers_lowess import lowess

# Assuming df is your DataFrame and has been defined elsewhere

# Function to plot a variable against SalePrice with Pearson and Spearman trendlines
def plot_with_trendlines(df, var, target='SalePrice'):
    x = df[var]
    y = df[target]
    
    # Pearson correlation
    pearson_coef, _ = pearsonr(x, y)
    slope_pearson, intercept_pearson, _, _, _ = linregress(x, y)
    line_pearson = slope_pearson * x + intercept_pearson
    
    # Spearman correlation
    spearman_coef, _ = spearmanr(x, y)
    lowess_smoothed = lowess(y, x, frac=0.3)
    
    # Plotting
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=x, y=y, label='Data points')
    
    plt.plot(x, line_pearson, color='red', label=f'Pearson trendline (r={pearson_coef:.2f})')
    plt.plot(lowess_smoothed[:, 0], lowess_smoothed[:, 1], color='blue', label=f'Spearman trendline (r={spearman_coef:.2f})')
    
    plt.xlabel(var)
    plt.ylabel(target)
    plt.title(f'{var} vs {target} with Pearson and Spearman Trendlines')
    plt.legend()
    plt.show()

# Example for 'YearBuilt'
plot_with_trendlines(df, 'YearBuilt')

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.boxplot(x='KitchenQual', y='SalePrice', data=df)
plt.xlabel('KitchenQual')
plt.ylabel('SalePrice')
plt.title('SalePrice Distribution by KitchenQual')

plt.subplot(1, 2, 2)
sns.boxplot(x='OverallQual', y='SalePrice', data=df)
plt.xlabel('OverallQual')
plt.ylabel('SalePrice')
plt.title('SalePrice Distribution by OverallQual')

plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr

# Ensure plots are displayed in Jupyter notebook
%matplotlib inline

def correlation_to_sale_price_spearman(df, vars_to_study):
    """ Joint plots of variables vs SalePrice with Spearman correlation annotation """
    target_var = 'SalePrice'
    
    for col in vars_to_study:
        # Calculate Spearman correlation
        spearman_corr, p_value = spearmanr(df[col], df[target_var])
        
        # Create scatter plot with regression line
        g = sns.lmplot(data=df, x=col, y=target_var, line_kws={'color': 'red'})
        
        # Set the title and labels
        g.set_axis_labels(col, target_var, fontsize=15)
        g.fig.suptitle(f"{col} (Spearman: {spearman_corr:.2f}, p-value: {p_value:.2e})", fontsize=20, y=1.05)
        
        plt.show()

def plot_categorical_vs_sale_price(df, categorical_vars):
    """ Box plots of categorical variables vs SalePrice with mean curve overlay """
    target_var = 'SalePrice'
    
    for col in categorical_vars:
        plt.figure(figsize=(10, 6))
        
        # Create box plot
        sns.boxplot(x=df[col], y=df[target_var])
        
        # Calculate mean SalePrice for each category
        means = df.groupby(col)[target_var].mean()
        
        # Overlay mean SalePrice curve
        plt.plot(means.index, means.values, color='red', marker='o', linestyle='--', linewidth=2, markersize=8)
        
        # Add titles and labels
        plt.title(f"{col} vs {target_var}", fontsize=20, y=1.05)
        plt.xlabel(col, fontsize=15)
        plt.ylabel(target_var, fontsize=15)
        
        plt.xticks(rotation=45)
        
        plt.show()

In [None]:
print(df_eda.head())
print(df_eda[vars_to_study].describe())
print(df_eda[categorical_vars].describe())

In [None]:
correlation_matrix = df[['KitchenQual_Encoded', 'OverallQual']].corr()
print("Correlation between KitchenQual and OverallQual:")
print(correlation_matrix)

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
corr_pearson

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
corr_spearman

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

In [None]:
import pandas as pd
import ppscore as pps

# Load the DataFrame with your data
df = pd.read_csv("/workspace/heritage-housing2/jupyter_notebooks/outputs/datasets/cleaned/HousePricesCleaned.csv")

# Calculate the PPS matrix
pps_matrix = pps.matrix(df)

# Filter the PPS matrix to show only the rows where 'y' is 'SalePrice'
pps_against_saleprice = pps_matrix[pps_matrix['y'] == 'SalePrice']

# Sort by the PPS score in descending order
pps_against_saleprice_sorted = pps_against_saleprice.sort_values(by='ppscore', ascending=False)

# Display the sorted PPS matrix
print(pps_against_saleprice_sorted)

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, spearmanr

# Assuming df is your DataFrame
# df = pd.read_csv('your_data.csv')

# Identify numeric variables
numeric_vars = df.select_dtypes(include=[np.number]).columns.tolist()

# Identify categorical variables and convert to numerical
df_encoded = pd.get_dummies(df, columns=categorical_vars, drop_first=True)

# Update the list of numeric variables to include the new one-hot encoded columns
numeric_vars = df_encoded.select_dtypes(include=[np.number]).columns.tolist()

# Ensure the list of variables is unique
numeric_vars = list(set(numeric_vars))

# Dictionary to store correlation results
correlations = {'Variable': [], 'Pearson': [], 'Spearman': []}

# Calculate correlations
for var in numeric_vars:
    if var != 'SalePrice':  # Exclude the target variable itself
        x = df_encoded[var]
        y = df_encoded['SalePrice']
        
        pearson_coef, _ = pearsonr(x, y)
        spearman_coef, _ = spearmanr(x, y)
        
        correlations['Variable'].append(var)
        correlations['Pearson'].append(pearson_coef)
        correlations['Spearman'].append(spearman_coef)

# Create a DataFrame with the correlation results
correlation_df = pd.DataFrame(correlations)

# Calculate the absolute values of the correlations
correlation_df['Abs_Pearson'] = correlation_df['Pearson'].abs()
correlation_df['Abs_Spearman'] = correlation_df['Spearman'].abs()

# Rank the variables based on absolute correlations
correlation_df['Pearson_Rank'] = correlation_df['Abs_Pearson'].rank(ascending=False)
correlation_df['Spearman_Rank'] = correlation_df['Abs_Spearman'].rank(ascending=False)

# Combine the ranks (average of Pearson and Spearman ranks)
correlation_df['Combined_Rank'] = (correlation_df['Pearson_Rank'] + correlation_df['Spearman_Rank']) / 2

# Sort the DataFrame based on the combined rank
correlation_df.sort_values(by='Combined_Rank', inplace=True)

# Display the correlation results
print(correlation_df)

# Extract the most related variable
most_related_variable = correlation_df.iloc[0]['Variable']
print(f"The most related variable to SalePrice is: {most_related_variable}")

In [None]:
def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)


In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman,
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.2, PPS_Threshold = 0.1,
                  figsize=(12,10), font_annot = 10)

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
corr_spearman

In [None]:
plt.bar(x=corr_spearman[:10].index, height=corr_spearman[:10])
plt.title("Spearman Correlation", fontsize=20, y=1.05)
plt.show()

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
corr_pearson

In [None]:
plt.bar(x=corr_pearson[:6].index, height=corr_pearson[:6])
plt.title("Pearson Correlation", fontsize=20, y=1.05)
plt.show()

In [None]:
import pandas as pd

# The specified variables
variables = ['OverallQual', 'GrLivArea', 'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'YearBuilt']
target = 'SalePrice'

# Assuming df_corr_pearson and df_corr_spearman are your correlation matrices
# Extract the correlation values for the specified variables
pearson_values = df_corr_pearson.loc[variables, target]
spearman_values = df_corr_spearman.loc[variables, target]

# Create a DataFrame to compare the values
comparison_df = pd.DataFrame({
    'Variable': variables,
    'Pearson Correlation': pearson_values.values,
    'Spearman Correlation': spearman_values.values
})

# Sort the DataFrame by 'Pearson Correlation' in descending order
sorted_comparison_df = comparison_df.sort_values(by='Pearson Correlation', ascending=False)

# Display the sorted DataFrame
print(sorted_comparison_df)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Define your data frame containing these variables and sales price
# Assuming `df` is your DataFrame containing the data

# Continuous variables
continuous_vars = ['YearBuilt', 'BsmtFinSF1', '1stFlrSF', 'OverallQual']

# Categorical variables
categorical_vars = ['KitchenQual', 'BsmtExposure']

# Plot continuous variables against sales price
plt.figure(figsize=(16, 8))
for i, var in enumerate(continuous_vars, 1):
    plt.subplot(2, 3, i)
    sns.scatterplot(x=var, y='SalePrice', data=df)
    plt.title(f'{var} vs SalePrice')

# Plot categorical variables against sales price
for i, var in enumerate(categorical_vars, 1):
    plt.subplot(2, 3, len(continuous_vars) + i)
    sns.boxplot(x=var, y='SalePrice', data=df)
    plt.title(f'{var} vs SalePrice')

plt.tight_layout()
plt.show()

In [None]:
### PPS
* We notice that 1stFlrSF is dominating while there are five other relevant variables to consider.
* We will perform a deeper analysis for the following data:
    * OverallQual     0.440962
    * KitchenQual     0.261966
    * YearBuilt       0.198485
    * GarageArea      0.187993
    * GarageYrBuilt   0.158649
    * YearRemodAdd    0.143284

### Pearson correlation
* OverallQual       (0.790982)
* GrLivArea         (0.708624)
* GarageArea        (0.623431)
* TotalBsmtSF       (0.613581)
* 1stFlrSF          (0.605852)
* KitchenQual       (-0.589189)

### Conclusion of first check for correlation
* We want to check if we can use one value in the same category (e.g area or quaility variables) to represent two or more similar to create an effecient variable to use
  in pipeline if possible. 
    * Category - Quality: We have two quality variables Overall Quality and Kitchen Quality we want to see if we have a strong between these ones, we have the following correlation;
      * 







* 1stFlrSF - mediate pps but a bit lower on correlation quite close between Pn and Sp.
* OverallQual -  lower pps but high correlation with sale price.
* BsmtExposure - low pps, and bsmt exposure is not even in the first 6 for pearson and spearman.
* YearBuilt - low pps and do not exist in pearson but medium in spearman.
* KitchenQual - low pps and negative for pearson, does not exist for spearman.
* BsmtFinSF1 - lowest pps does not exists for pearson nor spearman.


* BsmtExposure - Only PPS
* YearBuilt - PPS and Spearman
* KitchenQual - PPS and Pearson
* BsmtFinSF1 - Only PPS

* GrLivArea - Pearson and Spearman
* GarageArea - Pearson and Spearman
* TotalBsmtSF - Pearson and Spearman


### Spearman correlation
* OverallQual     0.809829
* GrLivArea       0.731310
* YearBuilt       0.652682
* GarageArea      0.649379
* TotalBsmtSF     0.602725
* 1stFlrSF        0.575408

<div style="display: flex;">
    <div style="margin-right: 20px;">
        <h2>PPS</h2>
        <table>
            <tr><th>Variable</th><th>Score</th></tr>
            <tr><td>OverallQual</td><td>0.441</td></tr>
            <tr><td>KitchenQual</td><td>0.262</td></tr>
            <tr><td>YearBuilt</td><td>0.198</td></tr>
            <tr><td>GarageArea</td><td>0.188</td></tr>
            <tr><td>GarageYrBuilt</td><td>0.159</td></tr>
            <tr><td>YearRemodAdd</td><td>0.143</td></tr>
        </table>
    </div>
    <div style="margin-right: 20px;">
        <h2>Pearson Correlation</h2>
        <table>
            <tr><th>Variable</th><th>Correlation</th></tr>
            <tr><td>OverallQual</td><td>0.791</td></tr>
            <tr><td>GrLivArea</td><td>0.709</td></tr>
            <tr><td>GarageArea</td><td>0.623</td></tr>
            <tr><td>TotalBsmtSF</td><td>0.614</td></tr>
            <tr><td>1stFlrSF</td><td>0.606</td></tr>
            <tr><td>KitchenQual</td><td>-0.589</td></tr>
        </table>
    </div>
    <div>
        <h2>Spearman Correlation</h2>
        <table>
            <tr><th>Variable</th><th>Correlation</th></tr>
            <tr><td>OverallQual</td><td>0.810</td></tr>
            <tr><td>GrLivArea</td><td>0.731</td></tr>
            <tr><td>YearBuilt</td><td>0.653</td></tr>
            <tr><td>GarageArea</td><td>0.649</td></tr>
            <tr><td>TotalBsmtSF</td><td>0.603</td></tr>
            <tr><td>1stFlrSF</td><td>0.575</td></tr>
        </table>
    </div>
</div>

* The most significant variables considering both Pearson and Spearman is: OverallQual, GrLivArea, GarageArea, TotalBsmtSF, 1stFlrSF, YearBuilt (for Pearson correlation the last was KitchenQual but with a lower value hence the YearBuilt was identified as the 6th value).

## Compare pearson and spearman
* In this section we will check if there is any significant differences between pearson and spearman for the most significant variables.

## Summary comparison Pearson, Spearman and Power Predictive Score
* If we compare the power predictive score most with the equivavelt for pearson correlation we see the following differences:
    * 1stFlrSF PPS (1)  - PC (2) : PPS is calculated at a higher rank than PC.
    * OverallQual (2)   - PC (1) : PPS is calculated at a lower rank than PC.
    * BsmtExposure (3)  - PC (6) : PPS is calculated at a higher rank than PC.
    * YearBuilt (4)     - PC (4) : PPS is calculated at the same rank.
    * KitchenQual (5)   - PC (3) : PPS is calculated at a lower rank than PC.
    * BsmtFinSF1  (6)   - PC (5) : PPS is calculated at a lower rank than PC.

* We noticed that for the pearson and spearman correlation factor, the clearest difference in were the following order: 
    * YearBuilt     (0.13)
    * BsmtFinSF1    (0.09)
    * 1stFlrSF      (0.03)
    * KitchenQual   (0.02)
    * OverallQual   (0.02)
    * BsmtExposure  (0.01)  
    

## Grade variables according to significance
* For the data we are interested to know if the importance of each variable, meaning that values close to either either -1 or 1 are most significant. Sorted by Pearson , furhter analysis on discrepancy between Pearson and Spearman will be done in upcoming sections.
1. OverallQual  (Pn.  0.79)
2. 1stFlrSF     (Pn.  0.61)
3. KitchenQual  (Pn. -0.59)
4. YearBuilt    (Pn.  0.52)
5. BsmtFinSF1   (Pn.  0.39)
6. BsmtExposure (Pn. -0.31)

* OverallQual and KitchenQual
    * The quality variables show similarities between the Spearman curves for OverallQual and KitchenQual. For KitchenQual, a poor kitchen quality keeps the price down, resulting in a relatively flat price level between 1.0 and 3.0. However, beyond this point, the price increases significantly with higher quality levels.

    * The corresponding curve for OverallQual is closer to the Pearson trendline and indicates a more linear relationship between OverallQual and price. However, the Spearman trendlin e for OverallQual also has a flat appearance for quality levels 1 through 7, after which the price increases more rapidly with each increment in quality, similar to the KitchenQual variable.

    * Given these similarities, we believe it is possible to merge the Spearman curves and create a single quality variable for predicting sale price. This approach will be further analyzed in the next notebook, focusing on feature engineering.

Garage initial assessment
Upon initial examination, there is no strong indication that the GarageFinish feature has a significant correlation with the sale price. Additionally, the initial investigation into the relationship between garage area and sale price reveals a relatively low correlation.

Given that the size of the house typically exhibits a strong correlation with sale price, the comparatively weaker correlation observed with garage area suggests that other factors may have a more pronounced influence on the final sale price.

Further analysis is warranted to understand the nuanced relationship between garage attributes and sale price. This may include exploring potential outliers, considering interactions with other features, and employing more advanced analytical techniques to capture non-linear relationships effectively.

By delving deeper into these factors, we can gain a more comprehensive understanding of the garage's impact on property valuation and make more informed decisions regarding its significance in the overall pricing model.

In our analysis of basement exposure, we observed a noteworthy trend: properties with missing values for basement exposure tend to have lower sale prices. This suggests that the absence of basement exposure data may signal certain property characteristics that contribute to decreased market value.

Moreover, our examination revealed a subtle but discernible impact on sale price attributed to good living quarters within the basement. Properties featuring well-finished living spaces below ground level exhibited a slight positive influence on sale price, indicating a preference among buyers for quality basement amenities.

These findings underscore the importance of considering basement attributes in property valuation, as they can significantly influence market perceptions and ultimately affect sale prices. Further exploration into the nuances of basement features and their impact on property value is warranted to provide deeper insights for real estate decision-making.

Since the the sale prices range is lower and closest to "No" we will impute the missing values with  "No" for future calculations.Section 1 content

There is no clear connection between BsmtFinType and sales price. Since low-quality and average recreation rooms are essentially the same, the finish type has little effect. What can be discerned is that the quality of living quarters is influencing the price; however, this is most likely due to location rather than the finish type.

The sales price tends to increase with newer garages; however, this trend is likely influenced by property characteristics. Outliers were detected between 1993 and 1996, but subsequently, sales prices reverted to lower levels.

From this investigation we can that the following factors are most relevant: Quality, Space(Area), Age(YearBuilt), according to our Business requirement * 1. - "The client is interested in discovering how the house attributes correlate with the sale price". We can now know that these factors and the associated variables has the strongest correlation with the sale price. 

In [None]:
unique_kitchen_qualities = df['KitchenQual'].unique()
print("Unique values in 'KitchenQual' column:", unique_kitchen_qualities)

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
