# Notebook 06 - Lot Area and Frontage data cleaning and fixing

## Objectives
* Clean data
* Evaluate and process missing data
* Fix potential issues with data in feature (LotArea, LotFrontage)

## Inputs
* inputs/datasets/cleaning/kitchen.csv

## Outputs
* Clean and fix (missing and potentially wrong) data in given column
* After cleaning is completed, we will save current dataset in inputs/datasets/cleaning/lot_features.csv

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks/data_cleaning'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

you have set a new current directory


Confirm new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks'

We need to check current working directory

In [4]:
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks'

We can see that current is **jupyter_notebooks**, as current notebook is in subfolder. We will go one step up to parent directory, what will be our project main directory.
Print out to confirm working directory

In [5]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5'

## Loading Dataset

In [6]:
import pandas as pd

df = pd.read_parquet("inputs/datasets/cleaning/kitchen.parquet.gzip")
df.head()

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,0,856,854,3,No,706,GLQ,150,0.0,548,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1,1262,0,3,Gd,978,ALQ,284,,460,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,2,920,866,3,Mn,486,GLQ,434,0.0,608,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,3,961,0,2,No,216,ALQ,540,,642,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,4,1145,0,4,Av,655,GLQ,490,0.0,836,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


## Exploring Data

We will get all features that are missing data as a list

In [7]:
df['LotArea'].isnull().sum()

0

In [8]:
df['LotFrontage'].isna().sum()

259

We can see that there is no missing values in LotArea, but LotFrontage - has 259

As it is Linear feet of street connected to property, it is very hard to decide how it could be correlated to other features, as it does not depend on house owner, but on designer. 

We will replace all missing values with mean

In [9]:
df['LotFrontage'].mean()

70.04995836802665

In [10]:
df.loc[:, 'LotFrontage'] = df['LotFrontage'].fillna(70)

We will check if LotFrontage and LotArea data types

In [16]:
# Print the data type of 'LotFrontage'
print("Data type of 'LotFrontage':", df['LotFrontage'].dtype)

# Print the data type of 'LotArea'
print("Data type of 'LotArea':", df['LotArea'].dtype)

Data type of 'LotFrontage': int64
Data type of 'LotArea': int64


LotFrontage need converting to integer

In [12]:
df['LotFrontage'] = df['LotFrontage'].round().astype(int)

Now all we can do is check if LotArea values are correct:

Lot size generally should be bigger than:
* WoodDeckSF - Wood deck area in square feet
* GarageArea - Size of garage in square feet
* GrLivArea - above ground living area in square feet
* EnclosedPorch - Enclosed porch area in square feet
* OpenPorchSF - Open porch area in square feet

So we have to summ all those features values, and they have to be smaller than lot area.
You can not build a house that has features on too small plot area

In [13]:
# Summing relevant area features
df['SummedAreas'] = df['WoodDeckSF'] + df['GarageArea'] + df['GrLivArea'] + \
                    df['EnclosedPorch'] + df['OpenPorchSF']

# Creating a boolean mask where summed areas exceed LotArea
df['IsAreaExceeded'] = df['SummedAreas'] > df['LotArea']

# Filtering rows where the built-up area exceeds the lot of area
inconsistent_records = df[df['IsAreaExceeded']]

# Display these records
inconsistent_records

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,...,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice,SummedAreas,IsAreaExceeded


We can see that all  LotArea values are within range.

## Removing added columns

We will use same code as in previous cleaning notebook 04_basement.ipynb

In [14]:
# Removing Extra columns that originally do not belong to dataset, as we have created them

df_original_features = pd.read_csv("outputs/datasets/collection/HousePricesRecords.csv")

# Identify columns in df that are also in df_original
common_columns = df.columns.intersection(df_original_features.columns)

# Filter df to only include those common columns
df = df[common_columns]

df

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,0,856,854,3,No,706,GLQ,150,0.0,548,...,65,196.0,61,5,7,856,0.0,2003,2003,208500
1,1,1262,0,3,Gd,978,ALQ,284,,460,...,80,0.0,0,8,6,1262,,1976,1976,181500
2,2,920,866,3,Mn,486,GLQ,434,0.0,608,...,68,162.0,42,5,7,920,,2001,2002,223500
3,3,961,0,2,No,216,ALQ,540,,642,...,60,0.0,35,5,7,756,,1915,1970,140000
4,4,1145,0,4,Av,655,GLQ,490,0.0,836,...,84,350.0,84,5,8,1145,,2000,2000,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1455,953,694,3,No,0,Unf,953,,460,...,62,0.0,40,5,6,953,0.0,1999,2000,175000
1456,1456,2073,0,2,No,1028,ALQ,514,,500,...,85,119.0,0,6,6,1542,,1978,1988,210000
1457,1457,1188,1152,4,No,275,GLQ,877,,252,...,66,0.0,60,9,7,1152,,1941,2006,266500
1458,1458,1078,0,2,Mn,49,Unf,1029,112.0,240,...,68,0.0,0,6,5,1078,,1950,1996,142125


## Saving current dataset

We will save current dataset as inputs/datasets/cleaning/lot_features.parquet.gzip

In [15]:
df.to_parquet('inputs/datasets/cleaning/lot_features.parquet.gzip', compression='gzip')

### Adding cleaning code to pipeline

```python
df.loc[:, 'LotFrontage'] = df['LotFrontage'].fillna(70)
df['LotFrontage'] = df['LotFrontage'].round().astype(int)
```

## Next step is cleaning Masonry Veneer and Porch features - cleaning and fixing data in garages