#  **Notebook 2: Housing Prices Correlation**

## Objectives
* Client is interested in discovering how house attributes correlate with the sale price. 
* Create data Visualizations of correlated variables against sale prices

## Inputs
* outputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv

## Outputs
* Overview all data and data types, so before we process to data cleaning, we have better understanding on dataset

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [3]:
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [4]:
import pandas as pd

df = pd.read_csv('outputs/datasets/collection/HousePricesRecords.csv')
df.head()

## Exploring the Given Data

In our data exploration phase, we will utilize a **Profile Report** to comprehensively analyze the dataset provided. This report will help us understand various aspects of the data, which are crucial for further analysis and model building. Here’s what the Profile Report will include:

### Feature Breakdown
- **Overview of Each Feature**: The report will provide a detailed examination of each feature (column) in the dataset, including data types, unique values, and missing values. This helps in identifying features that require cleaning or transformation.

### Distribution of Features
- **Statistical Summary and Distributions**: For each numeric feature, the report will include statistics such as mean, median, range, and standard deviation, along with histograms to visualize the distribution. Categorical features will be summarized with frequency counts and bar charts, illustrating how observations are distributed among different categories.
- **Correlations**: The report will also explore correlations between features, identifying potential relationships that could inform feature selection and engineering.

### Missing Data Analysis
- **Missing Values**: Identification and quantification of missing data within each feature. The report will highlight patterns of missing data, which are critical for deciding how to handle them—whether to impute, discard, or use techniques like modeling to estimate missing values.
- **Impact of Missing Data**: Analysis of how missing values could affect analyses and potential biases they might introduce in the model.

### Benefits of Profile Reporting
- **Efficiency**: Quickly gain insights into the dataset without manually plotting each feature.
- **Decision Support**: Empower data-driven decisions on preprocessing steps, such as feature scaling, encoding of categorical variables, and handling missing values.
- **Model Readiness**: Ensure the data is well-understood and appropriately prepped before moving into predictive modeling.

By generating a Profile Report, we aim to lay a solid foundation for all subsequent data handling and analytical tasks. This structured approach not only saves time but also highlights critical insights that drive the analytical strategy forward.


In [5]:
from ydata_profiling import ProfileReport

pandas_report = ProfileReport(df, minimal=True)
pandas_report.to_notebook_iframe()

## Conclusions

There is a lot of missing data, also it is possible lots of data will have mistakes.

We will move to next notebook for Data Cleaning, then we will check correlations