# **House Sales Price Study**

## Objectives

* Answer business requirement 1:
  * The client is interested to understand the most relevant house variables correlate against the sale price.

## Inputs

* outputs/datasets/collection/house_prices_after_inspection.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App

## Additional Comments

* Data derives from Kaggle but has been provided by CI 


---

# Change working directory to the parent folder

Access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-heritage-housing-issues/jupyter_notebooks'

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-heritage-housing-issues'

# Load the Data

In [4]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/house_prices_after_inspection.csv")
#df.head()
df.tail()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
1455,953,694.0,3.0,No,0,Unf,953,,460,RFn,...,62.0,0.0,40,5,6,953,0.0,1999,2000,175000
1456,2073,0.0,,No,790,ALQ,589,,500,Unf,...,85.0,119.0,0,6,6,1542,,1978,1988,210000
1457,1188,1152.0,4.0,No,275,GLQ,877,,252,RFn,...,66.0,0.0,60,9,7,1152,,1941,2006,266500
1458,1078,0.0,2.0,Mn,49,,0,112.0,240,Unf,...,68.0,0.0,0,6,5,1078,,1950,1996,142125
1459,1256,0.0,3.0,No,830,BLQ,136,0.0,276,Fin,...,75.0,0.0,68,6,5,1256,736.0,1965,1965,147500


# Create a profile report for quick Exploratory Data Analysis (EDA)

In [5]:
from ydata_profiling import ProfileReport
profile_report= ProfileReport(df=df, minimal=True)
#profile_report
#profile_report.to_notebook_iframe()

## EDA Observations

* This dataset hast a predominance for numerical variables.
* Only 4 variables are categorical: BsmtExposure, BsmtFinType1, GarageFinish, KitchenQual
* The 4 categorical variables are imbalanced.
* Several variables have missing values and zeros.
* Most numerical variables seem to be not normally distributed. 

# Handle Missing Values (NaN)

In [None]:
# for col in df.select_dtypes(include='object').columns:
#     if df[col].isnull().any():
#         df[col] = df[col].fillna('Missing')

# Correlation Study: Pearson and Spearman

**Goal:** identify how the target (SalesPrice) correlate to the variables, and retrieve the top 5 correlation variables for SalesPrice.

* Step 1: Handle M

* Step 1: Since Spearman and Peason need numeric variables: transform categorical variables to numerical variables using one hot encoding.

In [10]:
from feature_engine.encoding import OneHotEncoder
one_hot_encoder = OneHotEncoder(variables=df.select_dtypes(include='object').columns.to_list(), drop_last=False, handle_unknown='ignore')
#one_hot_encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = one_hot_encoder.fit_transform(df)
df_ohe.tail()


TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'handle_unknown'

In [9]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)

ValueError: could not convert string to float: 'No'

---

# Load and Inspect Kaggle data

### Read CSV files

In [None]:
import pandas as pd
df_house_prices = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv") 
df_house_prices.head()
# print(df.shape)


In [None]:
df_inherited_houses = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv") 
df_inherited_houses.head()
# print(df.shape)

### Read TXT files

In [None]:
df_house_metadata = pd.read_csv(f"inputs/datasets/raw/house-metadata.txt", header=None) 
df_house_metadata.head()
# print(df_house_metadata.shape)

### DataFrames Summary

In [None]:
df_house_prices.info()

In [None]:
df_inherited_houses.info()

In [None]:
df_house_metadata.info()

### Check for duplicates 
* There are no duplicates in the data, there is also no unique indentifier such as "HouseID" to drop.

In [None]:
df_house_prices.duplicated().sum()

### Confirm Target data type
* The target is already a numeric variable.

In [None]:
df_house_prices['SalePrice'].dtype

### Notes
* The variables GarageYrBlt, YearBuilt and YearRemodAdd are numeric. 
* While they could be converted to datetime data type, their current numerical format facilitates their use in Pearson and Spearman correlation analyses and as direct inputs for the regression model.

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

### Create outputs directory

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)


### Save the data under as csv

In [None]:
df_house_prices.to_csv(f"outputs/datasets/collection/house_prices_after_inspection.csv", index=False)