# **Notebook 4: Price Analysis**

## Objectives

* **Explore Relationships Between Features and Sale Price**
  * Conduct in-depth analysis of the relationships between house attributes and the target variable, `SalePrice`.
  * Utilize visualizations and statistical methods to identify trendds and patterns that affect property values.
* **Validate Business Hypotheses**
  * Test key hypotheses about the drivers of house prices, including the influence of quality, size and location attributes.
* **Generate Insights for Client Needs**
  * Provide actionable insights to help the Client to understand the factors influencing the value of their inherited properties and similar houses in Ames, Iowa.
  * Present findings in a clear and interpretable manner for client use.
* **Prepare Visualizations for Dashboard**
  * Develop interactive and static visualizations that effectively communicate findings and lign with the dashboard requirements.

## Inputs

* **Processed Datasets**
  * `x_train_transformed.csv`: Feature-engineered and scaled training dataset for modeling and analysis.
  * `x_test_transformed.csv`: Feature-engineered and scales testing dataset for validation.
  * `y_train.csv`: Training dataset target variable (SalePrice).
  * `y_test.csv`: Testing dataset target variable (SalePrice).
* **Supplementary Data**
  * Domain knowledge and project-specific hypotheses for guiding analysis.
* **Stored Locations**
  * Datasets are located in the `outputs/datasets/processed/transformed/` and `outputs/datasets/processed/split/` directories.

## Outputs

* **Insights and Findings**
  * Detailed analysis of the relationships between key features and sale price.
  * Validation of hypotheses with supporting evidence.
* **Visualizations**
  * Scatter plots, box plots, heatmaps, and other graphical representations to highlight trends and patterns.
  * Summary visualizations prepared for dashboard integration.
* **Documentation**
  * Summary of analysis, key takeaways, and recommendations for downstream modeling and dashboard integration.

## Additional Comments

* **Context**
  * This notebook focuses on data exploration and analysis, bridhing the gap between feature engineering and model building. It provides the foundation for deriving insights and recommendations.
* **Alignment with CRISP-DM**
  * This notebook aligns with the Data Understanding and Business Understanding steps, ensuring that exploratory findings are actionable and relevant to the client's needs.
* **Next Steps**
  * The outputs from this notebook will inform the Model Training and Evaluation notebook, where predictive models will be developed and optimized.


---

## Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5'

---

## Load and Prepare Data

**Overview**
In this section, we will:
1. Load the processed datasets required for price analysis.
2. Confirm the structure and contents of the datasets to ensure readiness for analysis.
3. Check for any discrepancies, such as missing values or incorrect data types, that may impact analysis.

In [4]:
import pandas as pd

# Define file paths for datasets
x_train_path = "outputs/datasets/processed/transformed/x_train_transformed.csv"
x_test_path = "outputs/datasets/processed/transformed/x_test_transformed.csv"
y_train_path = "outputs/datasets/processed/split/y_train.csv"
y_test_path = "outputs/datasets/processed/split/y_test.csv"

# Load Datasets
x_train = pd.read_csv(x_train_path)
x_test = pd.read_csv(x_test_path)
y_train = pd.read_csv(y_train_path)
y_test = pd.read_csv(y_test_path)

# Display basic information about the datasets
print("Training Features Dataset:")
print(x_train.info())
print("\nTesting Features Dataset:")
print(x_test.info())
print("\nTraining Target Dataset:")
print(y_train.info())
print("\nTesting Target Dataset:")
print(y_test.info())

# Preview the first few rows of each dataset
print("\nPreview of Training Features Dataset:")
display(x_train.head())

print("\nPreview of Testing Features Dataset:")
display(x_test.head())

print("\nPreview of Training Target Dataset:")
display(y_train.head())

print("\nPreview of Testing Target Dataset:")
display(y_test.head())

# Check for missing values
print("\nChecking for missing values in training features:")
print(x_train.isnull().sum())

print("\nChecking for missing values in testing features:")
print(x_test.isnull().sum())

print("\nChecking for missing values in training target:")
print(y_train.isnull().sum())

print("\nChecking for missing values in testing target:")
print(y_test.isnull().sum())

Training Features Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1168 entries, 0 to 1167
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1168 non-null   float64
 1   1       1168 non-null   float64
 2   2       1168 non-null   float64
 3   3       1168 non-null   float64
 4   4       1168 non-null   float64
 5   5       1168 non-null   float64
 6   6       1168 non-null   float64
dtypes: float64(7)
memory usage: 64.0 KB
None

Testing Features Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292 entries, 0 to 291
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       292 non-null    float64
 1   1       292 non-null    float64
 2   2       292 non-null    float64
 3   3       292 non-null    float64
 4   4       292 non-null    float64
 5   5       292 non-null    float64
 6   6       292 non-null    float64
dtypes: float64(7)
memor

Unnamed: 0,0,1,2,3,4,5,6
0,0.455469,-0.116096,0.887733,-0.437833,-0.292584,-0.161873,0.0
1,-0.718609,0.455054,-1.415946,0.85819,0.250597,-0.304082,1.0
2,1.988293,-1.409123,-1.415946,0.102176,-1.816242,-0.071879,0.0
3,1.107734,0.918129,0.640194,0.102176,0.609851,-0.477855,0.0
4,1.531707,1.593562,0.340697,-0.437833,0.474436,-1.22528,0.0



Preview of Testing Features Dataset:


Unnamed: 0,0,1,2,3,4,5,6
0,0.259789,-0.621623,0.841719,1.506202,-0.922794,-0.15846,0.0
1,-0.751222,0.792606,0.720328,0.642186,1.808434,0.61254,1.0
2,1.433867,-0.830326,-1.415946,-0.437833,-1.038836,-0.029579,0.0
3,0.781602,1.552164,0.541291,0.85819,0.425488,-1.22528,1.0
4,-1.175195,-0.420838,0.856249,1.182196,0.343995,0.717202,0.0



Preview of Training Target Dataset:


Unnamed: 0,SalePrice
0,145000
1,178000
2,85000
3,175000
4,127000



Preview of Testing Target Dataset:


Unnamed: 0,SalePrice
0,154500
1,325000
2,115000
3,159000
4,315500



Checking for missing values in training features:
0    0
1    0
2    0
3    0
4    0
5    0
6    0
dtype: int64

Checking for missing values in testing features:
0    0
1    0
2    0
3    0
4    0
5    0
6    0
dtype: int64

Checking for missing values in training target:
SalePrice    0
dtype: int64

Checking for missing values in testing target:
SalePrice    0
dtype: int64


**Expected Outputs:**

1. **Dataset Structure:**
   - Training Features: 1168 rows and 7 columns, all numeric and scaled (`float64`).
   - Testing Features: 292 rows and 7 columns, all numeric and scaled (`float64`).
   - Training Target: 1168 rows and 1 column (`int64`).
   - Testing Target: 292 rows and 1 column (`int64`).
2. **Preview of Data:**
   - The first few rows of each dataset confirm the structure, scaling and content.
   - Features are scaled, and target variables are numeric.
3. **Missing Values:**
   - No missing values in any dataset, as confirmed by the `0` counts across all columns.
4. **Readiness for Analysis:**
   - Datasets are complete, structures, and preprocessed for further exploration.

---

## Exploratory Data Analysis (EDA)

### Sale Price Distribution

### Correlation Analysis

### Pairwise Analysis

### Multivariate Analysis

### Feature Comparison Accross Quartiles

### Outlier Analysis

---

## Business Insights

### Key Drivers of Sale Price

### Client-Specific Observations

---

## Save Outputs

---

## Conclusion & Next Steps

### Conclusion

### Next Steps