# **Sales Price Study Notebook**

## Objectives

* Answer the first business requirement:
    * Enable the customer to visualize which features of the dataset are most closely correlated to the property price.
        
## Inputs
        
* outputs/datasets/collection/HousePricesData.csv

        
## Outputs
        
* Generate generate code that answers business requirement 1 and can be used to build the Streamlit App.

## Additional comments

* We will be applying the methodology described in https://github.com/Code-Institute-Solutions/churnometer/blob/main/jupyter_notebooks/02%20-%20Churned%20Customer%20Study.ipynb, adapting it to our database and business requirements.
        

---

## Change working directory

* We use os.getcwd() to access the current directory.

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/P5-heritage-housing-issues/jupyter_notebooks'

## Access the parent directory
* We want to make the parent of the current directory the new current directory.
    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("A new current directory has been set")

A new current directory has been set


## Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/P5-heritage-housing-issues'

---

## Load data

* Import the pandas library
* Load the dataset as a pandas DataFrame and assign it to our dataframe df_prices
* View the data in the df_prices variable

In [None]:
import pandas as pd
df_prices = (pd.read_csv("outputs/datasets/collection/HousePricesData.csv"))
df_prices.head(5)

## Data Exploration

* We want to get more familiar with the dataset, check the variables. Determine their type, distribution and missing levels. We will use this exploration to create an image of what these variables mean in our project's context.
* We will use Panda's Profile Report (https://github.com/ydataai/ydata-profiling/blob/develop/README.md) to get an analysis of each of our variables, see the missing data levels, range and so on.
    * We notice there are 20 numerical and 4 categorical variables.

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df_prices, minimal=True)
pandas_report.to_notebook_iframe()

## Correlation Study

* We will use Pearson's and Sperman's correlation tests. As both only consider numerical variables, we will use OneHotEncoder totransform the categorical data. The documentation can be found here https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

* In this dataset, the target variable is SalePrice, so we use .filter(), and sort values by their absolute value, using .sort_values(key=abs)

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman