# Notebook 01 - Building 1st and 2nd floor data cleaning and fixing

## Objectives
* Clean data
* Evaluate and process missing data
* Fix potential issues with data in features (1stFlrSF and 2ndFlrSF)

## Inputs
* outputs/datasets/collection/HousePricesRecords.csv

## Outputs
* Clean and fix (missing and potentially wrong) data in given column
* After cleaning is completed, we will save current dataset in inputs/datasets/cleaning/floors.parquet.gzip

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

We need to check current working directory

In [None]:
current_dir

We can see that current is **jupyter_notebooks**, as current notebook is in subfolder. We will go one step up to parent directory, what will be our project main directory.
Print out to confirm working directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_csv("outputs/datasets/collection/HousePricesRecords.csv")
df.head()

## Exploring Data

We will get all features that are missing data as a list

### Checking if there is any missing values in given features (1st and 2nd floor areas)

First we will check features for missing values

In [None]:
# List of column names to check for missing values
features_to_check = ['1stFlrSF', '2ndFlrSF']

# Loop through each column in the list
for column in features_to_check:
    # Check for missing values
    if df[column].isna().sum() > 0:
        print(f"There are missing values in '{column}'.")
        # Fill missing values with a default value - 0
        df[column] = df[column].fillna(0)
    else:
        print(f"No missing values in '{column}'.")


### Checking Data Type

In [None]:
df[['1stFlrSF', '2ndFlrSF']].dtypes

2ndFlrSF is float, we need to convert to integer

In [None]:
df['2ndFlrSF'] = df['2ndFlrSF'].astype(int)

### Checking Values for lower than zero

In [None]:
# Loop through each column in the list
for column in features_to_check:
    # Check if there are any negative values in the column
    if (df[column] < 0).any():
        print(f"There are negative values in '{column}', which is not allowed.")
    else:
        print(f"No negative values in '{column}'.")


### Checking for wrong data values, where 1st or 2nd floor area is bigger than ground-floor

In [None]:
for floor_col in features_to_check:
    # Find entries where floor area exceeds 'GrLivArea'
    invalid_areas = df[df[floor_col] > df['GrLivArea']]
    if not invalid_areas.empty:
        print(f"There are entries where '{floor_col}' is greater than 'GrLivArea'.")
        # Display the problematic entries
        print(invalid_areas[[floor_col, 'GrLivArea']])
    else:
        print(f"All '{floor_col}' values are within the valid range of 'GrLivArea'.")


We need also to inspect, is there any 2nd floor bigger then 1st

In [None]:
invalid_areas = df[df['1stFlrSF'] < df['2ndFlrSF']]
if not invalid_areas.empty:
    print(" There are records where 2nd floor is bigger than 1st floor, total number of records: ",
          invalid_areas.shape[0])
else:
    print("All values are correct")

We have found 129 records, where 2nd floor is bigger than 1st.

Such data is very unlikely.
Our steps:
1. Create extra column in dataset - to store which records are wrong
2. Create a copy of current dataset
3. Filter dataset where 2nd floor is bigger than 1st floor

In [None]:
# Creating extra column in dataset to store where 2nd floor is bigger than 1st
df['2nd_floor_larger'] = df['1stFlrSF'] < df['2ndFlrSF']

# Filtering dataset for wrong records and making copy of such dataset
bad_records = df[df['2nd_floor_larger']].copy()

# Calculating ratios
bad_records['floor_ratio'] = ((bad_records['2ndFlrSF'] - bad_records['1stFlrSF']) / bad_records['1stFlrSF']) * 100

bad_records[['1stFlrSF', '2ndFlrSF', 'floor_ratio']]

We can see there are quite high differences. Let's check what is average, just for curiosity

In [None]:
bad_records['floor_ratio'].mean()

This is sad, as average is so high, and there were many possibilities for such abnormalities to happen:
* mistyping
* entering values in wrong cells - 1st and 2nf floor areas were swapped when entering data
* It is real to be 2nd floor bigger, but it is very unlikely and uncommon, so we reject this

Let's check, is there any records, where 2nd floor is greater than 1st floor

In [None]:
test = bad_records[bad_records['2ndFlrSF'] > bad_records['GrLivArea']]
test

We can see, after given test, there is no records where swapped, and instead of entering to 1st floor, it was entered to 2nd floor

We will swap those given values back

In [None]:
indexes = df['2ndFlrSF'] > df['1stFlrSF']
df.loc[indexes, ['1stFlrSF', '2ndFlrSF']] = df.loc[indexes, ['2ndFlrSF', '1stFlrSF']].values

Let's check again, is there any records where 2nd floor is bigger than 1st, just to check if all is fixed

In [None]:
# Creating extra column in dataset to store where 2nd floor is bigger than 1st
df['2nd_floor_larger'] = df['1stFlrSF'] < df['2ndFlrSF']

# Filtering dataset for wrong records and making copy of such dataset
bad_records = df[df['2nd_floor_larger']].copy()

# Calculating ratios
bad_records['floor_ratio'] = ((bad_records['2ndFlrSF'] - bad_records['1stFlrSF']) / bad_records['1stFlrSF']) * 100

bad_records[['1stFlrSF', '2ndFlrSF', 'floor_ratio']]

We can see all records are fixed now.

As we have created extra columns in given dataset, before exporting it as csv, we will remove them, so amount of features will remain the same

In [None]:
# Importing original dataset
df_original = pd.read_csv('outputs/datasets/collection/HousePricesRecords.csv')

# Identify features that are in current and original datasets
matching_features = df.columns.intersection(df_original.columns)

# Applying just existing features, remaining will be discarded
df = df[matching_features]

df.head()

## Exporting current dataset as parquet

In [None]:
df.to_parquet('inputs/datasets/cleaning/floors.parquet.gzip', compression='gzip')

### Adding code to cleaning Pipeline:

```python
# Fill missing values and convert data types
df[['1stFlrSF', '2ndFlrSF']] = df[['1stFlrSF', '2ndFlrSF']].fillna(0).astype(int)

# Swap values where '2ndFlrSF' is greater than '1stFlrSF'
swap_idx = df['2ndFlrSF'] > df['1stFlrSF']
df.loc[swap_idx, ['1stFlrSF', '2ndFlrSF']] = df.loc[swap_idx, ['2ndFlrSF', '1stFlrSF']].values
```

## Next step is cleaning and fixing Bedrooms