# **Sales price study**

## Objectives

- Answer business requirement 1:
    - The client seeks to understand how various attributes of their houses influence the typical sale price.

- Data Visualization.

- Data Exploration.

- Conclusion.

## Inputs

* inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv 

## Outputs

* Data that answers business requirements.

## Additional Comments

* This file and its contents were inspired by and adapted from the Churnometer Walkthrough Project 2. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house-prices.csv")
    )
print(df.shape)
df.head()

---

# Data Exploration

In [None]:
from ydata_profiling import ProfileReport
pd_report = ProfileReport(df=df, minimal=True)
pd_report.to_notebook_iframe()

Dropping Columns with Missing Data:

In [None]:
df = df.drop(columns=['EnclosedPorch', 'GarageFinish', 'WoodDeckSF', 'LotFrontage'])

In this line, you are removing four columns (EnclosedPorch, GarageFinish, WoodDeckSF, LotFrontage) from the dataframe because these columns contain a significant amount of missing data. By dropping these columns, you ensure that the analysis proceeds with features that are more complete and reliable.

Removing Rows with Missing Values:

In [None]:
df = df.dropna()
df.index

Here, you are removing all rows that contain missing values using .dropna(). This ensures that the dataset is free from any incomplete data, which is necessary for the analysis and machine learning tasks to be accurate and consistent.

# Correlation Study

This code applies one-hot encoding to all categorical columns in the dataframe (df) to convert them into binary numeric columns. It ensures that categorical features are represented numerically, which is necessary for machine learning models to process them effectively. The OneHotEncoder is used to transform these variables, resulting in a new dataframe (df_ohe) that contains the original numerical features along with the encoded categorical features.

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

We use .corr() for spearman and pearson methods, and investigate the top 10 correlations
- We know this command returns a pandas series and the first item is the correlation between SalePrice - 'Pearson' and SalePrice - 'Spearman'.
- We sort values considering the absolute value, by setting key=abs

Spearman:

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

Pearson:

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

The correlation analysis shows that Overall Quality, Above Ground Living Area (GrLivArea), and Garage Area have the strongest positive correlations with Sale Price, indicating these factors are highly influential in determining house value. Kitchen Quality also plays a significant role, where higher quality kitchens (e.g., Good or Excellent) positively correlate with price, while a Typical/Average kitchen correlates negatively. The Year Built and Year Remodeled also have moderate positive correlations, suggesting newer or renovated houses tend to sell for more, although their influence is less pronounced than other factors like quality and living area.

Both Spearman and Pearson correlations generally support these conclusions.

Based on the correlation analysis, the five most interesting variables to study further, considering their influence on the Sale Price, are:

In [None]:
vars_to_study = ['OverallQual', 'GrLivArea', 'GarageArea', 'KitchenQual', 'YearBuilt']
vars_to_study

These five variables are important because they each represent different aspects of what influences a buyer's decision â€” from the quality of the finishes (OverallQual and KitchenQual) to practical features (GrLivArea and GarageArea), as well as the appeal of a newer construction (YearBuilt). Studying these can provide a comprehensive understanding of what drives property prices in this dataset.

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['SalePrice'])
print(df_eda.shape)
df_eda.head()

### Visualization of selected variables:

In [None]:
# Add this line to ensure plots are shown in Jupyter Notebook
%matplotlib inline

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
file_path = 'outputs/datasets/collection/house-prices.csv'  # Update the path if needed
df = pd.read_csv(file_path)

# Select variables of interest
variables_to_study = ['OverallQual', 'GrLivArea', 'GarageArea', 'KitchenQual', 'YearBuilt']
df = df[variables_to_study + ['SalePrice']]

# Map categorical variable 'KitchenQual' to numerical values for visualization purposes
df['KitchenQual'] = df['KitchenQual'].map({'Ex': 3, 'Gd': 2, 'TA': 1, 'Fa': 0})

# Plotting the correlations with SalePrice
for var in variables_to_study:
    plt.figure(figsize=(8, 6))
    sns.scatterplot(data=df, x=var, y='SalePrice')
    plt.title(f'Sale Price vs {var}', fontsize=16)
    plt.xlabel(var, fontsize=14)
    plt.ylabel('Sale Price', fontsize=14)
    plt.grid(True)
    plt.show()



# Conclusions 

- **Overall Quality (`OverallQual`) Shows Strong Correlation with Sale Price:**
  - The feature `OverallQual` exhibits a strong positive correlation with `SalePrice`, suggesting that better overall quality of the house increases its sale price significantly.

- **Above Ground Living Area (`GrLivArea`) Is Also a Strong Predictor:**
  - The `GrLivArea` (Above Ground Living Area) is positively correlated with `SalePrice`, indicating that larger living spaces tend to lead to higher house prices.

- **Garage Area (`GarageArea`) and Sale Price Are Positively Related:**
  - The scatter plot shows a positive correlation between `GarageArea` and `SalePrice`, meaning houses with larger garage spaces tend to have higher sale values.

- **Kitchen Quality (`KitchenQual`) Influences Sale Price:**
  - `KitchenQual` (mapped numerically) shows that higher kitchen quality (`Excellent` or `Good`) has a positive effect on `SalePrice`. Buyers likely value high-quality kitchens, making them an important factor in pricing.

- **Newer Houses (`YearBuilt`) Tend to Have Higher Sale Prices:**
  - The feature `YearBuilt` demonstrates that newer houses generally have higher sale prices, as they are perceived to be more modern and require fewer renovations.