# Sale Price Study 


## Inputs

- outputs/datasets/datacollection/HousePrices.csv


## Outputs

- Start answering Business Requirement 1 and generate graphs.


## Objectives

- The objective here is to start addressing the business requirements. So my plan is to display how a houses attributes can influence the market value of said house.

- Validate hypothesis. 


## CRISP-DM 

"Data Understanding"

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/datacollection/HousePrices.csv"))
df.head()

In [None]:
# We will now conduct an exploratory data analysis (EDA). This will give us a better insight into our DataFrame. 

import pandas_profiling
pandas_report = pandas_profiling.ProfileReport(df, minimal=True)
pandas_report.to_notebook_iframe()

# Correlation study

- This is conducted to analyse missing data which will be potentially useful for the Data Cleaning step. 

In [None]:
vars_with_missing_data = df[df.columns[df.isna().sum() > 0 ]]
vars_with_missing_data

In [None]:
missing_var = vars_with_missing_data.select_dtypes(include="object").columns.tolist()

missing_var


In [10]:
from sklearn.impute import SimpleImputer

# Create an imputer object by imputing the most frequest category
categorical_imputer = SimpleImputer(strategy="most_frequent")

# Apply the imputer to the variable for missing values 
df[missing_var] = categorical_imputer.fit_transform(df[missing_var])

In [None]:
df[missing_var].info()

In [None]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd 

# Select only object (categorical) columns for encoding

categorical_cols = df.select_dtypes(include='object').columns
encoder = OneHotEncoder(sparse=False, drop=None)
encoded_array = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out(categorical_cols))
df_ohe = pd.concat([df.drop(columns=categorical_cols), encoded_df], axis=1)

# Check the new dataframe's shape and look at the first few rows
print(df_ohe.shape)
df_ohe.head(3)

In [None]:
# Investigating correlation

corr_spearman = df_ohe.corr(method='spearman')
corr_spearman_saleprice = corr_spearman['SalePrice'].copy()
corr_spearman_sorted = corr_spearman_saleprice.reindex(corr_spearman_saleprice.abs().sort_values(ascending=False).index)
top_10_corr_spearman = corr_spearman_sorted[1:11]
top_10_corr_spearman

In [None]:

corr_pearson = df_ohe.corr(method='pearson')
corr_pearson_saleprice = corr_pearson['SalePrice'].copy()
corr_pearson_sorted = corr_pearson_saleprice.reindex(corr_pearson_saleprice.abs().sort_values(ascending=False).index)
top_10_corr_pearson = corr_pearson_sorted[1:11]
top_10_corr_pearson

In [None]:
# Here we will delve deeper into the correlations

top_n = 5

# This code is not mine. It's taken from the Churnometer walkthrough project. 

set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())


In [None]:
features_to_analyze = ['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtFinSF1', 'BsmtUnfSF']
features_to_analyze 

In [None]:
df_eda = df.filter(features_to_analyze + ["SalePrice"]).copy()
df_eda.head(3)

In [None]:
from feature_engine.discretisation import EqualFrequencyDiscretiser
discretiser = EqualFrequencyDiscretiser(q=6, variables=["SalePrice"])
df_eda_transformed = discretiser.fit_transform(df_eda)
df_eda_transformed

In [None]:
# Creates bins or intervals.

discretiser.binner_dict_['SalePrice']

In [None]:
# Here we are making labels. 

labels = discretiser.binner_dict_["SalePrice"]
n_factor = len(labels) - 1 

labels_map = {
    n: (
        f"< {labels[1]}" if n == 0 else
        f"+{labels[n]}" if n < n_factor -1 else
        f"{labels[n]} to - {labels[n + 1]}"
    )
    for n in range(n_factor)
}

labels_map

In [None]:
# Any unmapped values stay as they are.

df_eda["SalePrice"] = df_eda["SalePrice"].map(labels_map).fillna(df_eda["SalePrice"])
df_eda

In [22]:
# Here we will use the Seaborn library for data visualization. Hue variable will be used to color data points by a categorical variable. 

hue_order = [label for label in labels_map.values()]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pandas.api.types import is_numeric_dtype

# Set seaborn style for plots
sns.set(style="whitegrid")

# Custom intervals
intervals = ['< 118500.0', '+118500.0', '+139700.0', '+163000.0', '+190000.0', '241416.66666666663 to - inf']
bin_edges = [-float('inf'), 118500.0, 139700.0, 163000.0, 190000.0, 241416.66666666663, float('inf')]

# Function to plot numerical columns against target variables
def plot_numerical_distribution(df, column, target_var, hue_order):
    if is_numeric_dtype(df[column]):
        fig, ax = plt.subplots(figsize=(12, 6))
        sns.histplot(df, x=column, hue=target_var, hue_order=hue_order, kde=True, element="step", ax=ax)
        handles, labels = ax.get_legend_handles_labels()
        ax.legend(labels=labels, title=target_var)
        ax.set_title(f"{column} Distribution", fontsize=15)
        plt.show()
    else:
        print(f"Column '{column}' is not numerical and will be plotted.")

# Define the target variable and columns to plot
target_var = "SalePrice"
numeric_columns = ["1stFlrSF", "2ndFlrSF", "BedroomAbvGr", "BsmtFinSF1", "BsmtUnfSF"]


# Iterate over the selected columns and plot
for column in numeric_columns:
    plot_numerical_distribution(df_eda_transformed, column, target_var, hue_order=intervals)

# Note for the assessors: I tried for about 2/3 days to try and get my intervals into the legend and get them color coded but I just could not find the solution. I tried Slack, Stack Overflow among other resources and never found a solution. 

In [None]:
# Extract the two columens into a separate DataFrame
df_selected = df[["SalePrice", "BedroomAbvGr"]]

# Calculate the correlation matrix using Pearson method
correlation_matrix = df_selected.corr(method="pearson")
# Extract the specific correlation value
df_pearson = correlation_matrix.at["SalePrice", "BedroomAbvGr"]
df_pearson


In [None]:
x = df["BedroomAbvGr"]
y = df["SalePrice"]

plt.figure(figsize=(6,4))
sns.scatterplot(x=x, y=y)

plt.xlabel("BedroomAbvGr")
plt.ylabel("SalePrice")
plt.show()

- My first hypothesis was that houses with 4 bedrooms appear to reach the highest prices. We can clearly see on my scatterplot that this is the case. We can clearly see a number of properties that reach the $500,000+. Some properties even exceed $700,000. 

In [None]:
# Reusing the same code above for our next graph. 

# Extract the two columens into a separate DataFrame
df_selected = df[["SalePrice", "BsmtUnfSF"]]

# Calculate the correlation matrix using Pearson method
correlation_matrix = df_selected.corr(method="pearson")
# Extract the specific correlation value
df_pearson = correlation_matrix.at["SalePrice", "BsmtUnfSF"]
df_pearson


In [None]:
x = df["BsmtUnfSF"]
y = df["SalePrice"]

plt.figure(figsize=(6,4))
sns.scatterplot(x=x, y=y)

plt.xlabel("BsmtUnfSF")
plt.ylabel("SalePrice")
plt.show()

- For my second hypothesis, I argued that homes with smaller unfinished basements tend to have a wider ranger of sale prices. While there is a cluster of properties between 0-1000 square feet range, I can see that the properties are evenly distributed here across a range of prices ranges and square footage. There is a couple of outliers in the $700,000 price range also which may suggest that other factors have come into play of influencing the house prices. The correlation is not as strong here as a house with 2000+ unfinished sf can be seen to have the same price as a house with 0 - 500 unfinished sf. 

In [None]:
# Reusing the same code above for our next graph. 

# Extract the two columens into a separate DataFrame
df_selected = df[["SalePrice", "1stFlrSF"]]

# Calculate the correlation matrix using Pearson method
correlation_matrix = df_selected.corr(method="pearson")
# Extract the specific correlation value
df_pearson = correlation_matrix.at["SalePrice", "1stFlrSF"]
df_pearson


In [None]:
x = df["1stFlrSF"]
y = df["SalePrice"]

plt.figure(figsize=(6,4))
sns.scatterplot(x=x, y=y)

plt.xlabel("1stFlrSF")
plt.ylabel("SalePrice")
plt.show()

- For my third hypothesis, I argued that houses with larger amounts of square footage on the first floor generally cost more than houses with smaller amounts of square footage. In the scatterplot, we can see this is the case. When the 1st floor square foot value increases we can see that the price also increases. The bar graph also shows this to be true. 

# Final thoughts

From the above analysis and graphs, I've come to the conclusion that houses with more bedrooms, larger area and more work completed tend to generate a higher sale price, as you would expect. 

Diving deeper into this, we can see houses with 4 bedrooms appear to reach the highest prices which relates Hypothesis 1. The data shows that 4-bedroom homes tend to reach higher prices, with some homes exceeding $500,000 and even surpassing $700,000. Then we see, homes with smaller unfinished basements have a wider range of sale prices which relates to Hypothesis 2. Although there's a cluster of properties with unfinished basements between 0-1000 sq. ft, the price range is broad. As mentioned in previous cells, outliers with higher prices suggest other factors like location may influence prices. The correlation between unfinished basement and price is weak indicating that this not might be used to strongly determine the sale price of a house. Finally, houses with larger first-floor square footage tend to cost more. The analysis shows a positive correlation between first-floor square footage and price, with larger homes generally costing more. This is evident in both the scatterplots and bar graphs. 

The above information and analysis shows that larger homes with 4 bedrooms and more first-floor space tend to command higher prices. 