# Notebook 04 - Garages & Years for Build Dates  - Cleaning and fixing all Features

## Objectives
* Clean data
* Evaluate and process missing data
* Fix potential issues with data in given features:
    * GarageArea - Size of garage in square feet
    * GarageFinish - Interior Finish of the garage
    * GarageYrBlt - Year garage was built

## Inputs
* inputs/datasets/cleaning/basement.parquet.gzip

## Outputs
* Clean and fix (missing and potentially wrong) data in given column
* After cleaning is completed, we will save current dataset in inputs/datasets/cleaning/garages_and_build_years.parquet.gzip

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

We need to check current working directory

In [None]:
current_dir

We can see that current is **jupyter_notebooks**, as current notebook is in subfolder. We will go one step up to parent directory, what will be our project main directory.
Print out to confirm working directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_parquet("inputs/datasets/cleaning/basement.parquet.gzip")
df.head()

## Exploring Data

We will get all features that are missing data as a list, first we get given features datatypes

In [None]:
columns_of_interest = ['GarageArea', 'GarageFinish', 'GarageYrBlt']
column_types = df[columns_of_interest].dtypes

# Display the data types of these columns
df[columns_of_interest].dtypes

We will fix missing values:
1. GarageArea and GarageYrBlt missing values will be replaced with 0
2. GarageFinish missing values will be replaced with None

Also, we will convert GarageYrBlt from float to int

In [None]:
# Fill missing values for 'GarageArea' and 'GarageYrBlt' with 0
df[['GarageArea', 'GarageYrBlt']] = df[['GarageArea', 'GarageYrBlt']].fillna(0)

# Fill missing values for 'GarageFinish' with the string 'None'
df['GarageFinish'] = df['GarageFinish'].fillna('None')

# Convert 'GarageYrBlt' to integer
df['GarageYrBlt'] = df['GarageYrBlt'].astype(int)

df[columns_of_interest].head()

Now we will check, if there is no garage, all values should be 0 or None accordingly

We will reuse function from previous cleaning notebook - 03_basement.ipynb
Also we need to change features values to be inspected

In [None]:
def check_consistency(df, primary_feature):
    """
    Checks consistency of a primary feature against a set of expected values for related features.
    
    Parameters:
        df (pd.DataFrame): The DataFrame containing the data.
        primary_feature (str): The primary feature to be checked.
    
    Returns:
        None: Outputs inconsistency results directly.
    """
    # Directly define features and their values indicating 'no presence' in a dictionary
    features_and_values = {
        "GarageArea": 0,
        "GarageFinish": 'None',
        "GarageYrBlt": 0
    }

    # Ensure primary feature is valid
    if primary_feature not in features_and_values:
        print(f"Feature {primary_feature} not defined in feature settings.")
        return

    # Determine the primary value to check against
    primary_value = features_and_values[primary_feature]

    # Check each feature against the primary feature's condition
    df['Consistency'] = df.apply(
        lambda row: True if row[primary_feature] != primary_value else all(
            row[feature] == value for feature, value in features_and_values.items() if feature != primary_feature
        ), axis=1
    )

    # Filter and display inconsistent records
    inconsistent_records = df[df['Consistency'] == False]
    return inconsistent_records

We loop through each feature and print the results

Again we will reuse code from previous notebook

In [None]:
def loop_check_consistency(df, basement_features):
    for feature in basement_features:
        errors = check_consistency(df, feature)
        error_count = errors.shape[0]  # Get the number of rows in the errors DataFrame
        print(f"Feature {feature} has {error_count} inconsistent rows.")


# Run the loop check consistency function
loop_check_consistency(df, columns_of_interest)

We can see that all features are consistent except GarageFinish

Let's check that feature separately

In [None]:
garage_finish_check = check_consistency(df, 'GarageFinish')
garage_finish_check[columns_of_interest]

We can see that there is a lot of None, when we see that garage exist. Let's check is there any correlation between Garage Finish and other columns.
To achieve that, we will make a copy of dataframe, encode all objects as integers, then we will check correlations
Before that we need to encode it to numbers, to get correlations

In [None]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Create a copy of the DataFrame
df_encoded = df.copy()

# Initialize a LabelEncoder
encoder = LabelEncoder()

# Apply LabelEncoder to each categorical column
for column in df_encoded.select_dtypes(include=['object']).columns:
    df_encoded[column] = encoder.fit_transform(df_encoded[column])

# Check the transformed DataFrame
print(df_encoded.head())


In [None]:
# Calculate Pearson correlation for 'GarageFinish' assuming it's still relevant
# If 'GarageFinish' is no longer a column, replace it with an appropriate column name
if 'GarageFinish' in df_encoded.columns:
    corr_pearson = df_encoded.corr(method='pearson')['GarageFinish'].sort_values(ascending=False, key=abs)[1:].head(10)
    print(corr_pearson)
else:
    print("'GarageFinish' is not in the DataFrame. Please replace it with a relevant column.")


In [None]:
# Calculate Pearson correlation for 'GarageFinish' assuming it's still relevant
# If 'GarageFinish' is no longer a column, replace it with an appropriate column name
if 'GarageFinish' in df_encoded.columns:
    corr_spearman = df_encoded.corr(method='spearman')['GarageFinish'].sort_values(ascending=False, key=abs)[1:].head(
        10)
    print(corr_spearman)
else:
    print("'GarageFinish' is not in the DataFrame. Please replace it with a relevant column.")


In [None]:
# Calculate Pearson correlation for 'GarageFinish' assuming it's still relevant
# If 'GarageFinish' is no longer a column, replace it with an appropriate column name
if 'GarageFinish' in df_encoded.columns:
    corr_kendall = df_encoded.corr(method='kendall')['GarageFinish'].sort_values(ascending=False, key=abs)[1:].head(10)
    print(corr_kendall)
else:
    print("'GarageFinish' is not in the DataFrame. Please replace it with a relevant column.")


We can not see any correlations between GarageFinish and other features. 

Now we will apply most common value in all dataset - GarageFinish

In [None]:
mode_value = df['GarageFinish'].mode()[0]  # mode() returns a Series; [0] accesses the first mode
mode_value

In [None]:
df.loc[garage_finish_check.index, 'GarageFinish'] = 'Unf'

Now lets check if there is anymore values that do not match

In [None]:
garage_finish_check = check_consistency(df, 'GarageFinish')
garage_finish_check[columns_of_interest]

## GarageYrBlt fixing

We want to check when garages were build. Usually garages can not be built later than a house was build or remodeling:
1. We will filter buildings that have garage
2. We will filter buildings where garage build date is earlier that house build date

In [None]:
garage_year_mistake = df[
    (df['GarageYrBlt'] < df['YearBuilt']) & ((df['GarageFinish'] != 'None') | (df['GarageArea'] != 0))]
garage_year_mistake[['GarageYrBlt', 'YearBuilt', 'GarageFinish', 'GarageArea']]

We can see that there is 9 records, where garage was built (GarageYrblt) before House was built.
It is possible that garage was built on remodeling. Let's check Garage Year built lines up with house building and renovating dates (YearBuilt and YearRemodAdd accordingly)

In [None]:
garage_year_mistake[['GarageYrBlt', 'YearBuilt', 'GarageFinish', 'GarageArea', 'YearRemodAdd']]

Let's check, when most of the garages were built: on date of building or renovation

In [None]:
garages_built_same_as_house_year = df[(df['GarageYrBlt'] == df['YearBuilt']) & (df['GarageArea'] > 0)]
garages_built_same_as_renovation_year = df[(df['GarageYrBlt'] == df['YearRemodAdd']) & (df['GarageArea'] > 0)]

# Printing output
print("Garages built same as building: ", garages_built_same_as_house_year.shape[0])
print("Garages added during renovation: ", garages_built_same_as_renovation_year.shape[0])

We can see that most of them were built same time as building. but if we look more closely, we can see, that we have:
1088 + 725 = 1813, what is more than all records in dataset. It is possible, that renovation happened same year as house was built.
Let's check how many houses were renovated same yar as built

In [None]:
df[(df['GarageYrBlt'] == df['YearRemodAdd'])].shape[0]

This is very interesting, as we have 725 houses, which were renovated same date as they were built.
Let's check was there any renovations before houses were built

In [None]:
df[(df['GarageYrBlt'] > df['YearRemodAdd'])].shape[0]

We can see that there is 127 houses, which were renovated before they were built.

We need to check is there any NaN or zero values in House were build and renovated

In [None]:
print(df['YearBuilt'].isna().sum())  # Counts how many NaN values are in the 'YearBuilt' column
print(df['YearBuilt'].isnull().sum())  # Equivalent to isna(), also counts NaN values

print(df['YearRemodAdd'].isna().sum())  # Counts how many NaN values are in the 'YearRemodAdd' column
print(df['YearRemodAdd'].isnull().sum())  # Equivalent to isna(), also counts NaN values


We can see that there is no missing values or equal to 0, and we know that:
* All buildings were built between 1872 and 2010
* Same time we have information that renovation dates are also within same limits

We have to do next steps:
1. If Renovation date was between 1872 and 2010, but is smaller than build date, there was an error and data was entered in wrong cells, they need swapping
2. Filter out all buildings that have garage, and build date is NOT the same as renovation, then:
* we count how many garages were build same day as building
* we count how many garages were build during renovation
* Based on that, which ever is more - that is more common, and we will apply that date to garage build date for garages with wrong dates

In [None]:
df_tmp = df[(df['GarageYrBlt'] != df['YearRemodAdd'])]
df_tmp.shape[0]

In [None]:
garages_built_same_as_house_year = df[
    (df['GarageYrBlt'] == df['YearBuilt']) & (df['GarageArea'] > 0) & (df['GarageYrBlt'] != df['YearRemodAdd'])]
garages_built_same_as_renovation_year = df[
    (df['GarageYrBlt'] == df['YearRemodAdd']) & (df['GarageArea'] > 0) & (df['GarageYrBlt'] != df['YearRemodAdd'])]

# Printing output
print("Garages built same as building: ", garages_built_same_as_house_year.shape[0])
print("Garages added during renovation: ", garages_built_same_as_renovation_year.shape[0])

We can see, that all Garages were build same time as building, none were added during Renovation.

Based on that, if garage year is lower than house build date, we will change garage build date to house build date

In [None]:
# Correcting garage build years that are earlier than the house build year
df.loc[df['GarageYrBlt'] < df['YearBuilt'], 'GarageYrBlt'] = df['YearBuilt']

Checking again is there any remaining Garage dates mistakes.

In [None]:
garage_year_mistake = df[
    (df['GarageYrBlt'] < df['YearBuilt']) & ((df['GarageFinish'] != 'None') | (df['GarageArea'] != 0))]
garage_year_mistake[['GarageYrBlt', 'YearBuilt', 'GarageFinish', 'GarageArea']]

All Information with Garages is cleaned and fixed, same time we have fixed house Build and Renovation dates.

## Removing added columns

We will use same code as in previous cleaning notebook 04_basement.ipynb

In [None]:
# Removing Extra columns that originally do not belong to dataset, as we have created them

df_original_features = pd.read_csv("outputs/datasets/collection/HousePricesRecords.csv")
import pandas as pd

# Identify columns in df that are also in df_original
common_columns = df.columns.intersection(df_original_features.columns)

# Filter df to only include those common columns
df = df[common_columns]

df

## Saving current dataset

We will save current dataset as inputs/datasets/cleaning/garages_and_build_years.parquet.gzip

In [None]:
df.to_parquet('inputs/datasets/cleaning/garages_and_build_years.parquet.gzip', compression='gzip')

### Adding Cleaning code to pipeline
```python
# Direct assignments to fill missing values and convert data types for garage-related columns
df['GarageArea'] = df['GarageArea'].fillna(0)
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(0).astype(int)
df['GarageFinish'] = df['GarageFinish'].fillna('None')

# Define a dictionary for checking consistency based on 'GarageFinish'
features_and_values = {"GarageArea": 0, "GarageFinish": 'None', "GarageYrBlt": 0}

def check_consistency(df, primary_feature):
    primary_value = features_and_values[primary_feature]
    return df.apply(
        lambda row: all(row[feature] == value for feature, value in features_and_values.items()) 
        if row[primary_feature] == primary_value else True, axis=1
    )

# Apply consistency check and correct 'GarageFinish'
consistency_mask = check_consistency(df, 'GarageFinish')
df.loc[~consistency_mask, 'GarageFinish'] = 'Unf'

# Correct garage years that are earlier than the house build year
df.loc[df['GarageYrBlt'] < df['YearBuilt'], 'GarageYrBlt'] = df['YearBuilt']
```

## Next step is cleaning Kitchen Quality - cleaning and fixing data in garages