# Sale Price Study Notebook


## Objectives

- Answer business requirement 1

## Inputs

- The data loaded from the Data Collection Notebook: outputs/datasets/collection/house_prices_records.csv

## Outputs

- Correlation plots that answer business requirement 1.

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

First, we will load the house prices dataset previously gathered and stored in the repository during the Data Collection phase. 

We will then display the first few rows of the dataset to confirm its structure and contents:

In [None]:
import pandas as pd

# Load the house prices dataset
df = (pd.read_csv("outputs/datasets/collection/house_prices_records.csv"))

# Display the first five rows of the dataframe to inspect the data
df.head()

---

# Data Exploration

Next, we will utilize Pandas Profiling to conduct a detailed exploration of the dataframe.

This enables us to review and analyze the dataset’s structure, highlighting key aspects such as missing values, variable distributions, and potential anomalies.

In [None]:
from ydata_profiling import ProfileReport

# Generate a minimal report of the df DataFrame
pandas_report = ProfileReport(df=df, minimal=True)

# Display the report within the Jupyter Notebook
pandas_report.to_notebook_iframe()

**Dataset Overview**: Our review of the dataset provides the following:
- The dataset includes **24 columns** and **1460 rows**.
- Of these columns, **20 contain numerical data** while the remaining **4 are text variables** used as categorical identifiers.
- Importantly, the **"OverallCond"** and **"OverallQual"** variables use numerical values to scale the quality and condition of properties. <br> A similar approach may become helpful for the text-based categorical variables aswell.

Additionally, approximately 10% of the data is missing across various columns, a concern that will need to be addressed.

---

# Correlation Study

To understand the interactions among the dataset's features, we begin with an assessment of their relationships, focusing particularly on how they correlate with the Sale Price. 

Initially, we will employ both Pearson and Spearman correlation analyses to identify strong relationships within the data.

### Quick Explanation

**Pearson’s Correlation Coefficient**: Pearson’s correlation measures the linear relationship between two continuous variables. It quantifies the degree to which a pair of variables are related, providing a value between -1 and 1:
- 1 indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable also increases.
- -1 indicates a perfect negative linear relationship, meaning that as one variable increases, the other variable decreases.
- 0 indicates no linear correlation, suggesting that there is no linear dependence between the variables.

**Spearman’s Correlation Coefficient**: Spearman’s correlation measures the monotonic relationship between two variables, whether linear or not. Spearman’s correlation is particularly useful when the data is not normally distributed and like pearson it also provide a value between -1 and 1:
- 1 indicates a perfect positive monotonic relationship, where increasing values in one variable consistently correspond with increasing values in the other.
- -1 indicates a perfect negative monotonic relationship, where increasing values in one variable consistently correspond with decreasing values in the other.
- 0 indicates no monotonic correlation, suggesting that there is no consistent relationship between the rankings of the variables.

## Pearson’s Correlation Coefficient

In [None]:
# Calculates the pearson correlation coefficients between "SalePrice" and the other variables
# Sort the values in descending order highlighting the strongest relationships
# Excludes the first result, wich is the correlation of SalePrice with itself wich is always 1
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

## Spearman’s Correlation Coefficient

In [None]:
# Calculates the spearman correlation coefficients between "SalePrice" and the other variables
# Sort the values in descending order highlighting the strongest relationships
# Excludes the first result, wich is the correlation of SalePrice with itself wich is always 1
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

### Initial Insights

Based on the correlation analysis, where a coefficient above 0.70 indicates a strong correlation and values between 0.50 and 0.69 suggest moderate correlation, it is evident that several key variables significantly impact the sale price. Notably:

- **OverallQual**: 
    - Spearman: 0.809829
    - Pearson: 0.790982
    - **Analysis**: This variable shows the strongest correlation in both analyses. It is highly predictive of sale price, suggesting that improvements in quality can significantly impact the price.

- **GrLivArea** (Above ground living area square feet):
    - Spearman: 0.731310
    - Pearson: 0.708624
    - **Analysis**: This variable consistently shows a strong positive correlation, indicating that larger living areas tend to correspond to higher sale prices.

- **GarageArea**:
    - Spearman: 0.649379
    - Pearson: 0.623431
    - **Analysis**: The area of the garage in square feet also demonstrates a strong correlation with sale prices.

- **TotalBsmtSF** (Total square feet of basement area):
    - Spearman: 0.602725
    - Pearson: 0.613581
    - **Analysis**: Another strong correlate, indicating that the size of the basement is a key factor in home prices.

- **YearBuilt**: 
    - Spearman: 0.652682
    - Pearson: 0.522897
    - **Analysis**: Generally, newer homes tend to have higher sale prices, though the relationship is stronger in the Spearman correlation, suggesting that the relationship might not be strictly linear but is consistently positive.

### Supporting Analysis:

- **YearRemodAdd** (Year of Remodel/Addition):
    - Both correlations show that more recently remodeled or added constructions increase the home’s value. <br> This is simply speculation but since Overall Quality of the house plays the highest role on the sale price, it might be due to remodeling, higher quality materials are used and thus increasing the value.

- **1stFlrSF** (First Floor square feet):
    - Consistent strong correlation in both analyses.

---

### In-Depth Correlation Analysis

We will now conduct a detailed examination of the relationships among the variables identified as having significant influence on the sale price.

This extended analysis involves generating visual representations through heatmaps to better understand both the linear and nonlinear relationships present in the data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps

def heatmap_corr(df,threshold, figsize=(20,12), font_annot = 8):
  if len(df.columns) > 1:
    mask = np.zeros_like(df, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True

    fig, axes = plt.subplots(figsize=figsize)
    sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                linewidth=0.5
                     )
    axes.set_yticklabels(df.columns, rotation = 0)
    plt.ylim(len(df.columns),0)
    plt.show()


def heatmap_pps(df,threshold, figsize=(20,12), font_annot = 8):
    if len(df.columns) > 1:

      mask = np.zeros_like(df, dtype=np.bool)
      mask[abs(df) < threshold] = True

      fig, ax = plt.subplots(figsize=figsize)
      ax = sns.heatmap(df, annot=True, xticklabels=True,yticklabels=True,
                       mask=mask,cmap='rocket_r', annot_kws={"size": font_annot},
                       linewidth=0.05,linecolor='grey')
      
      plt.ylim(len(df.columns),0)
      plt.show()



def CalculateCorrAndPPS(df):
  df_corr_spearman = df.corr(method="spearman")
  df_corr_pearson = df.corr(method="pearson")

  pps_matrix_raw = pps.matrix(df)
  pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

  pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
  print("PPS threshold - check PPS score IQR to decide the threshold for the heatmap \n")
  print(pps_score_stats.round(3))

  return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix,CorrThreshold,PPS_Threshold,
                      figsize=(20,12), font_annot=8 ):

  print("\n")
  print("* Analyze how the target variable for your ML models are correlated with other variables (features and target)")
  print("* Analyze multi colinearity, that is, how the features are correlated among themselves")

  print("\n")
  print("*** Heatmap: Spearman Correlation ***")
  print("It evaluates monotonic relationship \n")
  heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Pearson Correlation ***")
  print("It evaluates the linear relationship between two continuous variables \n")
  heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Predictive power Score (PPS) ***")
  print(f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
  heatmap_pps(df=pps_matrix,threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

In [None]:
DisplayCorrAndPPS(df_corr_pearson=df_corr_pearson,
                  df_corr_spearman=df_corr_spearman, 
                  pps_matrix=pps_matrix,
                  CorrThreshold=0.6, PPS_Threshold=0.15,
                  figsize=(10,10), font_annot=8)

In [None]:
variables = {
    'OverallQual': 'Rating 1-10',
    'GrLivArea': 'Square Feet',
    'GarageArea': 'Square Feet',
    'TotalBsmtSF': 'Square Feet',
    'YearBuilt': 'Year',
    'YearRemodAdd': 'Year',
    '1stFlrSF': 'Square Feet'
}

# Loop through each variable and its unit in the dictionary and create a regression plot
for variable, unit in variables.items():
    plt.figure(figsize=(10, 6))
    sns.regplot(x=variable, y='SalePrice', data=df, line_kws={"color": "red"}, ci=None)
    sns.set_style("whitegrid")
    plt.title(f'Sale Price vs. {variable}')
    plt.xlabel(f'{variable} ({unit})')
    plt.ylabel('Sale Price ($)')
    plt.grid(True)
    plt.show()

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
