# Notebook 03 - Basement all features data cleaning and fixing

## Objectives
* Clean data
* Evaluate and process missing data
* Fix potential issues with data in given features:
    * BsmtExposure - Refers to walkouts or garden level walls
    * BsmtFinType1 - Rating of basement finished area
    * BsmtFinSF1 - Type 1 finished square feet (we believe it is finished basement area)
    * BsmtUnfSF - Unfinished square feet of basement area
    * TotalBsmtSF - Total square feet of basement area

## Inputs
* inputs/datasets/cleaning/bedrooms.csv

## Outputs
* Clean and fix (missing and potentially wrong) data in given features
* After cleaning is completed, we will save current dataset in inputs/datasets/cleaning/basement.csv

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks/data_cleaning'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

you have set a new current directory


Confirm new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks'

We need to check current working directory

In [4]:
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks'

We can see that current is **jupyter_notebooks**, as current notebook is in subfolder. We will go one step up to parent directory, what will be our project main directory.
Print out to confirm working directory

In [5]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5'

## Loading Dataset

In [6]:
import pandas as pd

df = pd.read_csv("inputs/datasets/cleaning/bedrooms.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,0,856,854,3,No,706,GLQ,150,0.0,548,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1,1262,0,3,Gd,978,ALQ,284,,460,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,2,920,866,3,Mn,486,GLQ,434,0.0,608,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,3,961,0,2,No,216,ALQ,540,,642,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,4,1145,0,4,Av,655,GLQ,490,0.0,836,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


## Exploring Data

We will get all features that are missing data as a list, first we get given features datatypes

In [7]:
columns_of_interest = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']
column_types = df[columns_of_interest].dtypes

# Display the data types of these columns
print("Data types of the specified columns:")
print(column_types)

Data types of the specified columns:
BsmtExposure    object
BsmtFinType1    object
BsmtFinSF1       int64
BsmtUnfSF        int64
TotalBsmtSF      int64
dtype: object


### Checking if there is any missing values in given column

In [8]:
# Check if there are any  missing values in these columns
missing_features = df[['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']].isnull().sum()

# Display the number of missing values per column after filling
print("Missing values per column:")
print(missing_features)

Missing values per column:
BsmtExposure     38
BsmtFinType1    145
BsmtFinSF1        0
BsmtUnfSF         0
TotalBsmtSF       0
dtype: int64


We can see that 2 features has missing values - BsmtExposure and BsmtType1

We will fill all missing values with None, as it is object type (None in our dataset refers to No Basement). We will inspect later if all values are correct

In [9]:
df['BsmtExposure'] = df['BsmtExposure'].fillna('None')
df['BsmtFinType1'] = df['BsmtFinType1'].fillna('None')

For easy of code use wi will define current features as a list

In [10]:
# Define a list of basement-related features
basement_features = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']

### Basement Consistency checking

As we have filled with None missing values of BsmtFinType1, we need to explore how consistent data is between all basement features.

We will create function, which one will compare given feature to remaining ones, if data is consistent:

In [11]:
def check_consistency(df, primary_feature):
    """
    Checks consistency of a primary feature against a set of expected values for related features.
    
    Parameters:
        df (pd.DataFrame): The DataFrame containing the data.
        primary_feature (str): The primary feature to be checked.
    
    Returns:
        None: Outputs inconsistency results directly.
    """
    # Directly define features and their values indicating 'no presence' in a dictionary
    features_and_values = {
        "BsmtExposure": "None",
        "BsmtFinType1": "None",
        "BsmtFinSF1": 0,
        "BsmtUnfSF": 0,
        "TotalBsmtSF": 0
    }

    # Ensure primary feature is valid
    if primary_feature not in features_and_values:
        print(f"Feature {primary_feature} not defined in feature settings.")
        return

    # Determine the primary value to check against
    primary_value = features_and_values[primary_feature]

    # Check each feature against the primary feature's condition
    df['Consistency'] = df.apply(
        lambda row: True if row[primary_feature] != primary_value else all(
            row[feature] == value for feature, value in features_and_values.items() if feature != primary_feature
        ), axis=1
    )

    # Filter and display inconsistent records
    inconsistent_records = df[df['Consistency'] == False]
    return inconsistent_records

We loop through each feature and print the results

In [12]:
def loop_check_consistency(df, basement_features):
    for feature in basement_features:
        errors = check_consistency(df, feature)
        error_count = errors.shape[0]  # Get the number of rows in the errors DataFrame
        print(f"Feature {feature} has {error_count} inconsistent rows.")

# Run the loop check consistency function
loop_check_consistency(df, basement_features)

Feature BsmtExposure has 1 inconsistent rows.
Feature BsmtFinType1 has 108 inconsistent rows.
Feature BsmtFinSF1 has 430 inconsistent rows.
Feature BsmtUnfSF has 81 inconsistent rows.
Feature TotalBsmtSF has 0 inconsistent rows.


## Consistency inspection and fixing

Given dataset gives a lot of inconsistency, so we will have to address each feature separately

In [13]:
BsmtExposure = check_consistency(df, 'BsmtExposure')
BsmtExposure[basement_features]

Unnamed: 0,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF
948,,Unf,0,936,936


If there is any basement in other columns, it means there is basement, and there was a mistake on data entering.

We will check replace all wrong data (None) to most frequent value of given feature in whole set

In [14]:
mode_value = df['BsmtExposure'].mode()[0]  # mode() returns a Series; [0] accesses the first mode
df.loc[BsmtExposure.index, 'BsmtExposure'] = mode_value

Now we will check BsmtExposure again for any inconsistencies

In [15]:
BsmtExposure = check_consistency(df, 'BsmtExposure')
BsmtExposure[basement_features]

Unnamed: 0,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF


We have no mistakes in Basement Exposure

### BsmtFinType1 Inconsistency

In [16]:
BsmtFinType1 = check_consistency(df, 'BsmtFinType1')
BsmtFinType1[basement_features]

Unnamed: 0,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF
11,No,,998,177,1175
22,No,,0,1777,1777
26,Mn,,234,180,900
55,No,,490,935,1425
89,No,,588,402,990
...,...,...,...,...,...
1415,No,,988,398,1386
1423,Av,,0,697,697
1435,No,,0,1319,1319
1446,No,,593,595,1188


We have 108 invalid records, also there is a chance, there is same mistake as with exposure.
Will apply same mistakes fixing style

In [17]:
mode_value = df['BsmtFinType1'].mode()[0]  # mode() returns a Series; [0] accesses the first mode
df.loc[BsmtFinType1.index, 'BsmtFinType1'] = mode_value

Checking again for any mistakes in BsmtFinType1

In [18]:
BsmtFinType1 = check_consistency(df, 'BsmtFinType1')
BsmtFinType1[basement_features]

Unnamed: 0,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF


We can see there is no mistakes at the moment

### BsmtFinSF1, BsmtUnfSF, TotalBsmtSF Inconsistency

This feature represents Unfinished area of basement. 

Previous Cells have showed, that if there is basement, they are displayed on BsmtExposure and BsmtFinType1, where they can not be None

In current cell we can have 0, as it might be correct value, as all basement is finished and are equals to 0
But before we proceed further, we need to check, is it a correct value:
BsmtFinSF1 + BsmtUnfSF = TotalBsmtSF

In [19]:
df['BsmtAreaCheck'] = (df['BsmtFinSF1'] + df['BsmtUnfSF'] == df['TotalBsmtSF'])
inconsistencies = df['BsmtAreaCheck'].value_counts()[False] if False in df['BsmtAreaCheck'].value_counts() else 0
inconsistencies

167

We can see that there are 167 incorrect values, This is why we will perform:
* if BsmtUnfSF == 0, we will replace it with TotalBsmtSF - BsmtFinSF1
* if BsmtFinSF1 == 0, we will replace it with TotalBsmtSF - BsmtUnfSF
* if totalBsmtSF == 0, we will replace it with BsmtFinSF1 + BsmtUnfSF

After that we will check for inconsistencies again

also we will add part of this code to cleaning pipeline

In [20]:
import pandas as pd

# Correcting BsmtUnfSF when it is erroneously zero
df.loc[(df['BsmtUnfSF'] == 0), 'BsmtUnfSF'] = df['TotalBsmtSF'] - df['BsmtFinSF1']

# Correcting BsmtFinSF1 when it is erroneously zero
df.loc[(df['BsmtFinSF1'] == 0), 'BsmtFinSF1'] = df['TotalBsmtSF'] - df['BsmtUnfSF']

# Correcting TotalBsmtSF when it is erroneously zero
df.loc[(df['TotalBsmtSF'] == 0), 'TotalBsmtSF'] = df['BsmtUnfSF'] + df['BsmtFinSF1']

# Adding a consistency check to verify corrections
df['BsmtAreaCheck'] = (df['BsmtFinSF1'] + df['BsmtUnfSF'] == df['TotalBsmtSF'])

# Counting and displaying inconsistencies
inconsistencies = df['BsmtAreaCheck'].value_counts().get(False, 0)
print(f"Number of inconsistencies after corrections: {inconsistencies}")

df[df['BsmtAreaCheck'] == False][['BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtAreaCheck']]



Number of inconsistencies after corrections: 129


Unnamed: 0,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,BsmtAreaCheck
7,859,216,1107,False
24,188,204,1060,False
26,234,180,900,False
43,280,167,938,False
44,179,465,1150,False
...,...,...,...,...
1418,25,247,1144,False
1424,457,193,1024,False
1439,315,114,539,False
1456,790,589,1542,False


Now we have discrepancies in basement areas between Finished and unfinished, where areas do not add up to total area.

To Go further, we will get overall ratio of all dataset between finished and unfinished basements, but only for records where sum of areas matches total

In [21]:
# Step 1: Filter for consistent records
consistent_records = df[df['BsmtFinSF1'] + df['BsmtUnfSF'] == df['TotalBsmtSF']]

# Step 2: Calculate the ratio of finished to unfinished areas
# Avoid division by zero by ensuring 'BsmtUnfSF' is not zero
consistent_records = consistent_records[consistent_records['BsmtUnfSF'] != 0]
consistent_records['Fin_Unf_Ratio'] = consistent_records['BsmtFinSF1'] / consistent_records['BsmtUnfSF']

# Step 3: Compute the overall average ratio
# This will give us the mean ratio of finished to unfinished basement areas
overall_ratio = consistent_records['Fin_Unf_Ratio'].mean()
overall_ratio

2.0626378723346837

We can now adjust Finished and Unfinished areas with given ratio, so sum of areas ads up

In [22]:
# Step 1: Filter for inconsistent records
inconsistent_records = df[df['BsmtFinSF1'] + df['BsmtUnfSF'] != df['TotalBsmtSF']]

# Step 2: Adjust using the overall ratio
for index, row in inconsistent_records.iterrows():
    total_bsmt_sf = row['TotalBsmtSF']
    # Calculate new values based on the overall ratio
    new_unf_sf = int(total_bsmt_sf / (overall_ratio + 1))
    new_fin_sf = total_bsmt_sf - new_unf_sf               

    # Assign the new values back to the DataFrame
    df.at[index, 'BsmtUnfSF'] = new_unf_sf
    df.at[index, 'BsmtFinSF1'] = new_fin_sf

# Step 3: Re-check consistency
df['ConsistencyCheck'] = (df['BsmtFinSF1'] + df['BsmtUnfSF'] == df['TotalBsmtSF'])
inconsistencies_after_adjustment = df['ConsistencyCheck'].value_counts().get(False, 0)
print(f"Number of inconsistencies after adjustments: {inconsistencies_after_adjustment}")


Number of inconsistencies after adjustments: 0


We have completed Basement data Fixing, all we need is now to check all values are correct, adds up and are consistent

In [23]:
loop_check_consistency(df, basement_features)
df['BsmtAreaCheck'] = (df['BsmtFinSF1'] + df['BsmtUnfSF'] == df['TotalBsmtSF'])

inconsistencies = df['BsmtAreaCheck'].value_counts()[False] if False in df['BsmtAreaCheck'].value_counts() else 0
print()
print("Total number where areas of basement do not add up: ", inconsistencies)

Feature BsmtExposure has 0 inconsistent rows.
Feature BsmtFinType1 has 0 inconsistent rows.
Feature BsmtFinSF1 has 430 inconsistent rows.
Feature BsmtUnfSF has 43 inconsistent rows.
Feature TotalBsmtSF has 0 inconsistent rows.

Total number where areas of basement do not add up:  0


We can ignore following rows:
Feature BsmtFinSF1 has 430 inconsistent rows.
Feature BsmtUnfSF has 43 inconsistent rows.
Feature TotalBsmtSF has 0 inconsistent rows.

They were used just for checking BsmtExposure and BsmtFinType1, as they can be zero, based on how much basement is finished or not

## We have found no issues with given features

We will save given dataset as outputs/data_cleaning/04_basement.csv
Before saving, we will:
1. Remove Columns that do not belong to given dataset
2. Encode columns BsmtExposure and BsmtFinType1 as numbers:
* Create file for managing all encodings encoders.py
* create file all_encodings.json
* encode given features
* save encoding dictionaries in file all_encodings.json
 
Export dataset as inputs/datasets/cleaning/basement.csv

In [24]:
# Removing Extra columns that originally do not belong to dataset, as we have created them

df_original_features = pd.read_csv("outputs/datasets/collection/HousePricesRecords.csv")
import pandas as pd

# Identify columns in df that are also in df_original
common_columns = df.columns.intersection(df_original_features.columns)

# Filter df to only include those common columns
df = df[common_columns]

df.head()


Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,0,856,854,3,No,706,GLQ,150,0.0,548,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1,1262,0,3,Gd,978,ALQ,284,,460,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,2,920,866,3,Mn,486,GLQ,434,0.0,608,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,3,961,0,2,No,216,ALQ,540,,642,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,4,1145,0,4,Av,655,GLQ,490,0.0,836,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


### Encoding BsmtExposure as numbers

In [25]:
# Encoding given features as numbers, will reduce dataset size, what would potentially increase future calculations with dataset
from sklearn.preprocessing import LabelEncoder
import joblib

# Creating an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fitting and transforming the column to encode
df['BsmtExposure'] = label_encoder.fit_transform(df['BsmtExposure'])

# Showing the mapping
mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Current encoding: ", mapping)

# saving encoder settings
joblib.dump(label_encoder, 'models/joblib/bsmt_exposure_encoder.joblib')


Current encoding:  {'Av': 0, 'Gd': 1, 'Mn': 2, 'No': 3, 'None': 4}


['models/joblib/bsmt_exposure_encoder.joblib']

### Encoding BsmtFinType1 as numbers

In [26]:
# Encoding given features as numbers, will reduce dataset size, what would potentially increase future calculations with dataset
from sklearn.preprocessing import LabelEncoder

# Creating an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fitting and transforming the column to encode
df['BsmtFinType1'] = label_encoder.fit_transform(df['BsmtFinType1'])

# Showing the mapping
mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Current encoding: ", mapping)

joblib.dump(label_encoder, 'models/joblib/bsm_fin_type_1_encoder.joblib')

Current encoding:  {'ALQ': 0, 'BLQ': 1, 'GLQ': 2, 'LwQ': 3, 'None': 4, 'Rec': 5, 'Unf': 6}


['models/joblib/bsm_fin_type_1_encoder.joblib']

## Saving current dataset

We will save current dataset as inputs/datasets/cleaning/04_basement.csv

In [27]:
df.to_csv('inputs/datasets/cleaning/basement.csv', index=False)

### Adding Basement Cleaning code Pipeline

```python
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import joblib

def fill_missing_values(df, column_name, method='mode'):
    """
    Fills missing values in a DataFrame column using the specified method.
    
    Args:
    df (DataFrame): The DataFrame containing the column.
    column_name (str): The column to fill missing values in.
    method (str): Method to use for filling, defaults to 'mode'.
    
    Returns:
    DataFrame: The DataFrame with filled values.
    """
    if method == 'mode':
        mode_value = df[column_name].mode()[0]
        df.loc[df[column_name].isnull(), column_name] = mode_value
    return df

def correct_basement_area(df):
    """
    Corrects basement area calculations where any of the basement area fields might be zero.
    
    Args:
    df (DataFrame): The DataFrame to correct.
    
    Returns:
    DataFrame: The corrected DataFrame.
    """
    conditions = {
        'BsmtUnfSF': (df['BsmtUnfSF'] == 0),
        'BsmtFinSF1': (df['BsmtFinSF1'] == 0),
        'TotalBsmtSF': (df['TotalBsmtSF'] == 0)
    }

    for column, condition in conditions.items():
        df.loc[condition, column] = df['TotalBsmtSF'] - df['BsmtFinSF1'] - df['BsmtUnfSF']

    return df

def calculate_fin_unf_ratio(df):
    """
    Calculates the ratio of finished to unfinished basement areas.
    
    Args:
    df (DataFrame): The DataFrame to calculate ratios in.
    
    Returns:
    DataFrame: The DataFrame with a new column 'Fin_Unf_Ratio' for the ratio.
    """
    df = df[df['BsmtUnfSF'] != 0]
    df['Fin_Unf_Ratio'] = df['BsmtFinSF1'] / df['BsmtUnfSF']
    return df

def encode_features(df, column_names):
    """
    Encodes categorical features using LabelEncoder and saves the encoders.
    
    Args:
    df (DataFrame): The DataFrame containing the features to encode.
    column_names (list): List of column names to encode.
    
    Returns:
    DataFrame: The DataFrame with encoded features.
    """
    for column in column_names:
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        joblib.dump(le, f'models/joblib/{column}_encoder.joblib')
        print(f"Encoding for {column}: ", dict(zip(le.classes_, le.transform(le.classes_))))
    return df

# Main operations
columns_of_interest = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']
df = fill_missing_values(df, 'BsmtExposure')
df = fill_missing_values(df, 'BsmtFinType1')
df = correct_basement_area(df)

# Data type inspection
column_types = df[columns_of_interest].dtypes
print("Data types of the specified columns:")
print(column_types)

# Consistency checks
df['BsmtAreaCheck'] = (df['BsmtFinSF1'] + df['BsmtUnfSF'] == df['TotalBsmtSF'])
inconsistencies = df['BsmtAreaCheck'].value_counts().get(False, 0)
print(f"Number of inconsistencies: {inconsistencies}")

# Encode features
df = encode_features(df, ['BsmtExposure', 'BsmtFinType1'])
```

## Next step is cleaning Garages - cleaning and fixing data in garages