# **Data Exploration Study**

## Objectives

### Business Requirement 1:
The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualisations of the correlated variables against the sale price to show that.

#### Covered in this Notebook:
1) Features and correlations related to: Missing Values
2) Features and correlations related to: Feature Types
3) Distribution of Continuous Numerical Features
4) Variable significance in a business context


## Inputs

outputs/datasets/collection/house_prices_records.csv

## Outputs

Implement the code that answers business requirement 1 and can be used to build the Streamlit App

## Additional Notes

The Target Variable is "SalePrice".

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

# EDA: Import Tools

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Display all columns of the DataFrame
pd.pandas.set_option('display.max_columns',None)

# EDA: Load Data

In [None]:
dataset_raw_path = "outputs/datasets/collection/house_prices_records.csv"
dataset = pd.read_csv(dataset_raw_path)
print(dataset.shape)

In [None]:
dataset.head()

# EDA: Start

# Explore Features
## Of Type: All
### Target: Missing Values

In [None]:
features_with_missing_values=[features for features in dataset.columns if dataset[features].isnull().sum()>1]

for feature in features_with_missing_values:
    print(feature, np.round(dataset[feature].isnull().mean(),4), ' % Percentage of Missing Values in entire Dataset')

## Explore Correlations
### Between: Missing Values & the Target Variable

In [None]:
for feature in features_with_missing_values:
    data_mval = dataset.copy()
    
    # Replace missing values with 1
    # Features with NO missing values are assigned 0 
    data_mval[feature] = np.where(data_mval[feature].isnull(), 1, 0)
    
    data_mval.groupby(feature)['SalePrice'].median().plot.bar()
    plt.title(feature)
    plt.show()

#### The missing values will be handled in the Feature Engineering section of the project.

# Explore Features
## Of Type: Numerical

In [None]:
# List
numerical_features = [feature for feature in dataset.columns if dataset[feature].dtypes != 'O']

print('Amount of Numerical Features: ', len(numerical_features))

# Visualize
dataset[numerical_features].head()

# Explore Features
## Of Type: Temporal

In [None]:
# List
year_feature_in_numerical_features = [feature for feature in numerical_features if 'Yr' in feature or 'Year' in feature]
year_feature_in_numerical_features

In [None]:
for feature in year_feature_in_numerical_features:
    print(feature, dataset[feature].unique())

## Explore Correlations
### Between: Temporal Features & the Target Variable

In [None]:
# Explore
dataset.groupby('YearBuilt')['SalePrice'].median().plot()
plt.xlabel('Year Built')
plt.ylabel('Median House Price')
plt.title("YearBuilt vs SalePrice")

### Visualize Correlations
Note: X-Axis represents the Amount Of Years Elapsed

In [None]:
for feature in year_feature_in_numerical_features:
    if feature!='GarageYrBlt':
        # Implement a new "data"-variable
        data_of_year_features=dataset.copy()
        # Compare
        data_of_year_features[feature]=data_of_year_features['GarageYrBlt']-data_of_year_features[feature]

        plt.scatter(data_of_year_features[feature],data_of_year_features['SalePrice'])
        plt.xlabel(feature)
        plt.ylabel('SalePrice')
        plt.show()

# Explore Features
## Of Type: Numerical - Discrete

In [None]:
# Set "25" as the threshold for discrete variables
discrete_in_numerical_features=[feature for feature in numerical_features if len(dataset[feature].unique())<25 and feature not in year_feature_in_numerical_features]

print("Amount of Discrete Features: {}".format(len(discrete_in_numerical_features)))

In [None]:
discrete_in_numerical_features

In [None]:
dataset[discrete_in_numerical_features].head()

## Explore Correlations
### Between: Discrete Features & the Target Variable

In [None]:
for feature in discrete_in_numerical_features:
    data_discrete=dataset.copy()
    data_discrete.groupby(feature)['SalePrice'].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
    plt.title(feature)
    plt.show()

### Conclusions & Observations
- Monotonic Relationship with Target Variable: OverallQual

## Explore Correlations 
### Between: Continuous Features & the Target Variable

In [None]:
# Implement Variable
cont_feature=[feature for feature in numerical_features if feature not in discrete_in_numerical_features+year_feature_in_numerical_features]
print("Amount of Continuous Features: {}".format(len(cont_feature)))

In [None]:
for feature in cont_feature:
    data_cont=dataset.copy()
    data_cont[feature].hist(bins=25)
    plt.xlabel(feature)
    plt.ylabel("Amount")
    plt.title(feature)
    plt.show()

### Conclusions & Observations
- Non-Gaussian (Abnormal Distribution) patterns implies Skewness

# Explore Distribution
## Applied Technique: Logarithmic Transformation

(The actual technique is applied in the Data Cleaning notebook.)

- Is applied to Feature Type: Numerical - Continuous 
- Purpose: To handle skewness and normalize the distribution

## Explore Data Points
### Of Type: Outliers
#### With: Box Plot

In [None]:
# Identify Outliers
for feature in cont_feature:
    data_of_outliers=dataset.copy()
    if 0 in data_of_outliers[feature].unique():
        pass
    else:
        data_of_outliers[feature]=np.log(data_of_outliers[feature])
        data_of_outliers.boxplot(column=feature)
        plt.ylabel(feature)
        plt.title(feature)
        plt.show()

# Explore Features
## Of Type: Categorical

In [None]:
data_of_categorical=dataset.copy()

categorical_features=[feature for feature in dataset.columns if data_of_categorical[feature].dtypes=='O']
categorical_features

In [None]:
dataset[categorical_features].head()

In [None]:
for feature in categorical_features:
    print(' {} has {}'.format(feature,dataset[feature].unique()))

In [None]:
# List all their labels/categories 
for feature in categorical_features:
    print(' {} has {} labels'.format(feature,len(dataset[feature].unique())))

## Explore Correlations
### Between: Categorical Features & the Target Variable

In [None]:
for feature in categorical_features:
    data_of_categorical=dataset.copy()
    data_of_categorical.groupby(feature)['SalePrice'].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
    plt.title(feature)
    plt.show()

# EDA: Finish

In [None]:
import os
try:
    os.makedirs(name='outputs/datasets/collection')
except Exception as e:
    print(e)

# File-1) Data Points - Missing Values
data_mval.to_csv("outputs/datasets/collection/data_mval.csv", index=False)

# File-2) Temporal Features
data_of_year_features.to_csv("outputs/datasets/collection/data_of_year_features.csv", index=False)

# File-3) Discrete Numerical Features
data_discrete.to_csv("outputs/datasets/collection/data_discrete.csv", index=False)

# File-4) Cont. Numerical Features
data_cont.to_csv("outputs/datasets/collection/data_cont.csv", index=False)

# File-5) Data Points - Outliers
data_of_outliers.to_csv("outputs/datasets/collection/data_of_outliers.csv", index=False)

# File-6) Categorical Features
data_of_categorical.to_csv("outputs/datasets/collection/data_of_categorical.csv", index=False)