# Notebook 02 - Bedrooms data cleaning and fixing

## Objectives
* Clean data
* Evaluate and process missing data
* Fix potential issues with data in feature (BedroomAbvGr)

## Inputs
* inputs/datasets/cleaning/floors.parquet.gzip

## Outputs
* Clean and fix (missing and potentially wrong) data in given column
* After cleaning is completed, we will save current dataset in inputs/datasets/cleaning/bedrooms.parquet.gzip

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

We need to check current working directory

In [None]:
current_dir

We can see that current is **jupyter_notebooks**, as current notebook is in subfolder. We will go one step up to parent directory, what will be our project main directory.
Print out to confirm working directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_parquet("inputs/datasets/cleaning/floors.parquet.gzip")
df.head()

## Exploring Data

We check for missing data, it is we will replace with 0

In [None]:
df['BedroomAbvGr'] = df['BedroomAbvGr'].fillna(0)

We need to convert it to integer

In [None]:
df['BedroomAbvGr'] = df['BedroomAbvGr'].astype('int')

Checking is there any buildings, where bedrooms quantity is 0, as it is very unlikely to build house and no bedrooms


In [None]:
issues_bedrooms = df[df['BedroomAbvGr'] == 0]
issues_bedrooms

We have 105 records, where building has no bedrooms.

Before we proceed, we expect:
1. All houses with NO 2nd floor, we expect to be at least 1 bedroom
2. All houses with 2nd floor, we expect to be at least 2 bedrooms

Based on our expectations we will:
* get mean of bedrooms quantity of all houses with NO 2nd floor
* get mean of bedrooms of all houses with 2nd floor
* get mean of bedrooms in all houses (just to have basic picture)

In [None]:
print("Mean of bedrooms in houses with NO 2nd floor is:", df.loc[df['2ndFlrSF'] == 0, 'BedroomAbvGr'].mean())
print("Mean of bedrooms in houses with  2nd floor is:", df.loc[df['2ndFlrSF'] > 0, 'BedroomAbvGr'].mean())
print("Mean of bedrooms in houses with NO 2nd floor is:", df['BedroomAbvGr'].mean())


We can see that in average all houses has minimum 2 rooms, if there is 2nd floor, then it is 3

Let's try to see how bedrooms are distributed across buildings (we compare bedrooms qty and GrLivArea - total living area)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df['Has_2nd_Floor'] = (df['2ndFlrSF'] > 0).astype(int)  # 1 if there's a second floor, 0 otherwise

plt.figure(figsize=(12, 8))
sns.scatterplot(x='GrLivArea', y='BedroomAbvGr', hue='Has_2nd_Floor', data=df, palette={0: 'blue', 1: 'green'},
                style='Has_2nd_Floor', markers=['o', 's'], alpha=0.6)
plt.title('Distribution of Bedrooms vs. Living Area by Presence of Second Floor')
plt.xlabel('Total Living Area (sq ft)')
plt.ylabel('Number of Bedrooms')
plt.legend(title='Has Second Floor?', labels=['No', 'Yes'])
plt.show()

This plot does not give a lots of hints how might bedrooms be distributes across houses based on house living area.
Clusters are not visible, as each room ir has wide range of living area in buildings.

Let's try same plot, but also splitting it into bins (every 1000 square feet)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Splitting dataframe based on whether the house has bedrooms or not
df_with_bedrooms = df[df['BedroomAbvGr'] > 0].copy()

# Split data based on the presence of a second floor
df_no_2nd_floor = df_with_bedrooms[df_with_bedrooms['2ndFlrSF'] == 0].copy()
df_with_2nd_floor = df_with_bedrooms[df_with_bedrooms['2ndFlrSF'] > 0].copy()

# Fit K-Means for houses without a second floor
kmeans_no_2nd_floor = KMeans(n_clusters=4, random_state=0).fit(df_no_2nd_floor[['GrLivArea', 'BedroomAbvGr']])
df_no_2nd_floor['Cluster'] = kmeans_no_2nd_floor.labels_

# Fit K-Means for houses with a second floor
kmeans_with_2nd_floor = KMeans(n_clusters=4, random_state=0).fit(df_with_2nd_floor[['GrLivArea', 'BedroomAbvGr']])
df_with_2nd_floor['Cluster'] = kmeans_with_2nd_floor.labels_

# Plotting houses without a second floor
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df_no_2nd_floor, x='GrLivArea', y='BedroomAbvGr', hue='Cluster', palette='viridis', s=100,
                marker='o')
plt.title('Clustered Bedrooms vs. Total Living Area (No Second Floor)')
plt.xlabel('Total Living Area (sq ft)')
plt.ylabel('Number of Bedrooms')
plt.grid(True)
plt.legend(title='Cluster')
plt.show()

# Plotting houses with a second floor
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df_with_2nd_floor, x='GrLivArea', y='BedroomAbvGr', hue='Cluster', palette='viridis', s=100,
                marker='o')
plt.title('Clustered Bedrooms vs. Total Living Area (With Second Floor)')
plt.xlabel('Total Living Area (sq ft)')
plt.ylabel('Number of Bedrooms')
plt.grid(True)
plt.legend(title='Cluster')
plt.show()

## Clustering evaluation using elbow and Silhouette Scores

We will evaluate the optimal number of clusters (if possible) for dataset using distinct methods: Elbow Method and Silhouette Scores.

These methods can help to determinate the most suitable number of clusters by analyzing intra-cluster variation and comparing it against a reference distribution:
* **Elbow Method** - Identifies the point where decrease in the within-cluster sum of squares (inertia) with respect to the number of clusters becomes less pronounced
* **Silhouette Scores** - Measures how similar an object is to its own cluster compared to other clusters. A higher silhouette value indicates that the object is well-matched to its own cluster and poorly matched to neighbouring clusters

To proceed, we need to convert integers to float

In [None]:
df_with_bedrooms['GrLivArea'] = df_with_bedrooms['GrLivArea'].astype(float)
df_with_bedrooms['BedroomAbvGr'] = df_with_bedrooms['BedroomAbvGr'].astype(float)

In [None]:
import numpy as np
from gap_statistic import OptimalK

# Split data based on the presence of a second floor
df_no_2nd_floor = df_with_bedrooms[df_with_bedrooms['2ndFlrSF'] == 0].copy()
df_with_2nd_floor = df_with_bedrooms[df_with_bedrooms['2ndFlrSF'] > 0].copy()


# Function to apply KMeans and calculate the optimal number of clusters using Gap Statistic
def apply_kmeans_and_plot(data, title):
    if data.empty:
        print(f"The DataFrame for {title} is empty. No data to process.")
        return

    try:
        # Ensure that the data frame for clustering has no NaN values
        valid_data = data[['GrLivArea', 'BedroomAbvGr']].dropna()
        if valid_data.empty:
            print(f"No valid data available for clustering in {title}.")
            return

        print(f"Processing {title} with {len(valid_data)} entries.")

        optimal_k = OptimalK()
        n_clusters = optimal_k(valid_data.to_numpy(), n_refs=10, cluster_array=np.arange(1, 9))

        # Check gap statistics results
        gap_df = optimal_k.gap_df
        print(f"Optimal number of clusters for {title}: {n_clusters}")
        print("Gap Statistic Results:")
        print(gap_df)

    except Exception as e:
        print(f"An error occurred while processing {title}: {str(e)}")


# Apply to both dataframes
apply_kmeans_and_plot(df_no_2nd_floor, 'Bedrooms vs. Total Living Area (No Second Floor)')
apply_kmeans_and_plot(df_with_2nd_floor, 'Bedrooms vs. Total Living Area (With Second Floor)')

We were right... the best cluster is just one cluster from what we can see.

## Fixing missing data

Knowing that there is just one cluster, we can apply mean numbers for each type of building:
1. Bedrooms with NO 2nd floor mean for bedrooms quantity is 2.43
2. Bedrooms with 2nd floor mean for bedrooms quantity is 3.03

Based on this information each building with no bedrooms will receive 2 or 3 bedrooms accordingly 

In [None]:
df.loc[(df['2ndFlrSF'] == 0) & (df['BedroomAbvGr'] == 0), 'BedroomAbvGr'] = 2
df.loc[(df['2ndFlrSF'] > 0) & (df['BedroomAbvGr'] == 0), 'BedroomAbvGr'] = 3

## Removing any extra columns we have created in dataframe

In [None]:
# Importing original dataset
df_original = pd.read_csv('outputs/datasets/collection/HousePricesRecords.csv')

# Identify features that are in current and original datasets
matching_features = df.columns.intersection(df_original.columns)

# Applying just existing features, remaining will be discarded
df = df[matching_features]

df.head()

## Saving given dataframe

We will save dataframe at this point as inputs/datasets/cleaning/bedrooms.csv

In [None]:
df.to_parquet('inputs/datasets/cleaning/bedrooms.parquet.gzip', compression='gzip')

### Adding code to cleaning pipeline

```python
# Fill missing values, convert types, and update values based on conditions
df['BedroomAbvGr'] = df['BedroomAbvGr'].fillna(0).astype(int)
df.loc[df['2ndFlrSF'] == 0, 'BedroomAbvGr'] = df['BedroomAbvGr'].replace(0, 2)
df.loc[df['2ndFlrSF'] > 0, 'BedroomAbvGr'] = df['BedroomAbvGr'].replace(0, 3)
```


## Next step is cleaning all basement features