# **House Prices Study Notebook**

## Objectives

* Answer business requirement 1:
    * The client is interested to understand how different attributes affect Sale Price for houses generally in Ames, Iowa.

## Inputs

* Generate Dataset: outputs/datasets/collection/house_prices.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house_prices.csv")
    )
df.head(3)

---

# Data Exploration

* We are interested in getting more familiar with the dataset, checking variable type and distribution, missing levels and what these variables mean in a business context.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

* After reviewing the results of the profiling, the following columns were dropped for the Pearson and Spearman correlation studies due to high numbers of missing data and zeros.

In [None]:
df = df.drop(columns=['BsmtFinSF1', 'BsmtFinType1', 'BsmtUnfSF', 'GarageFinish', 'LotFrontage', 'MasVnrArea', 'EnclosedPorch', 'OpenPorchSF', 'WoodDeckSF'])

* Individual rows with missing data from the remaining columns are now dropped (from the Sheets analysis these rows belong to 'BedroomAbvGr', 'TotalBsmntArea' and 'GarageYearBlt').

In [None]:
df = df.dropna()
print("All rows containing missing data have now been removed.")


* We started with 1,459 rows of data, after removing the columns with high levels of missing data or zeros, we then removed remaining rows that contained missing data. We will be using the remaining 1,283 rows for our correlation study (approx 88% of the houses).

In [None]:
df.index

---

# Correlation Study

* Two of the columns in our dataset are objects. We need to use one hot encoder to convert their data into binary values.

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(10)

* We use .corr() for spearman and pearson methods, and investigate the top 10 correlations.

* This command returns a pandas series and the first item is the correlation between SalePrice and SalePrice, which happens to be 1, so we exclude that with [1:]

* We sort values considering the absolute value, by setting key=abs

### Spearman Method

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

### Pearson Method

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

* Two values return strong correlation with Sale Price  -  Overall Quality (OverallQual relates to the houses' **material and finish** - as opposed to overall **condition** which is OverallCond) and Above Ground living area in square foot.

* The other correlation levels returned as moderate correlation (between 0.3-0.7)

* We will consider the top five correlation levels for both tests at df_ohe and will study the associated variables at df

In [None]:
top_n = 5
combined_list = corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list()
significant_variables = set(combined_list)
print(significant_variables)


Therefore, we are studying at df the following variables.

We will investigate if a houses' Sale Price is affected by:

* The square feet of above ground living area ('GrLivArea')
* The overall material quality and finishes of the house ('OverallQual')
* The year the basement was built ('GarageYrBlt')
* The area of square foot of the houses' first floor ('1stFlrSF')
* The year the house was built ('YearBuilt')
* The total square feet of a houses' basement ('TotalBsmtSF')
* The size of the houses' garage in square feet ('GarageArea')



In [None]:
vars_to_study = ['GrLivArea', 'OverallQual', 'GarageYrBlt', '1stFlrSF', 'YearBuilt', 'TotalBsmtSF', 'GarageArea']
vars_to_study

---

# EDA on Selected Variables

In [None]:
df_eda = df.filter(vars_to_study + ['SalePrice'])
df_eda.head(3)

### Linear Regression 

We plot the distribution of our selected variables individually in relation to our target variable; Sale Price.

#### Bivariate Analysis

First we conduct bivariate analysis. This will highlight the correlation visually between individual variables and our taarget variable and allow us to see outliers in our data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(context='notebook', style='darkgrid', palette='husl')


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.regplot(data=df, x=col, y=target_var) 
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'SalePrice'
for col in vars_to_study:
        plot_numerical(df_eda, col, target_var)
        print("\n")

#### Notes on the Bivariate Analysis

* There are significant outliers present in our dataset.

#### Multivariate Analysis

In [None]:
sns.lmplot(data=df, x="GarageYrBlt", y="SalePrice", ci=None, hue="OverallQual")
plt.show()

In [None]:
mask = np.zeros_like(df_eda, dtype=np.bool)
mask[np.triu_indices_from]

seaborn.heatmap(data='df_eda', *, vmin=None, vmax=None, cmap=None, 
                center=None, robust=False, annot=None, fmt='.2g',
                annot_kws=None, linewidths=0, linecolor='white',
                cbar=True, cbar_kws=None, cbar_ax=None, square=False, 
                xticklabels='auto', yticklabels='auto', mask=None, ax=None, **kwargs)

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
