# Notebook 07 - Masonry veneer area and porch areas data cleaning and fixing

## Objectives
* Clean data
* Evaluate and process missing data
* Fix potential issues with data in features (MasVnrArea, EnclosedPorch, OpenPorch)

## Inputs
* inputs/datasets/cleaning/lot_features.parquet.gzip

## Outputs
* Clean and fix (missing and potentially wrong) data in given column
* After cleaning is completed, we will save current dataset in inputs/datasets/cleaning/masonry_and_porch.parquet.gzip

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

We need to check current working directory

In [None]:
current_dir

We can see that current is **jupyter_notebooks**, as current notebook is in subfolder. We will go one step up to parent directory, what will be our project main directory.
Print out to confirm working directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_parquet("inputs/datasets/cleaning/lot_features.parquet.gzip")
df.head()

## Exploring Data

We will get all features that are missing data as a list

In [None]:
print("Masonry veneer missing values are: ", df['MasVnrArea'].isnull().sum())
print("Enclosed Porch missing values are: ", df['EnclosedPorch'].isnull().sum())
print("Open Porch missing values are: ", df['OpenPorchSF'].isnull().sum())

We can see that there are missing values on Masonry Veneer and Enclosed Porch.

Enclosed Porch missing values we will replace with 0, as it is majority missing, this is why we believe it should be 0 - no enclosed porches

Masonry Veneer is missing just 8 values, for curiosity we will check what is mean, and how many 

In [None]:
df.loc[:, 'EnclosedPorch'] = df['EnclosedPorch'].fillna(value=0)
print("Masonry veneer mean number is: ", df['MasVnrArea'].mean())
print("Total amount of houses which have Masonry Veneer is: ", (df['MasVnrArea'] > 0).sum())

We can see that just about third of houses has Veneer, and even then it is about 103 square feet.
We will replace missing values with 0

In [None]:
df.loc[:, 'MasVnrArea'] = df['MasVnrArea'].fillna(value=0)

We will check data types for our current features

In [None]:
df[['MasVnrArea', 'EnclosedPorch', 'OpenPorchSF']].dtypes

We need converting MasVnrArea and EnclosedPorch to int

In [None]:
df[['MasVnrArea', 'EnclosedPorch']] = df[['MasVnrArea', 'EnclosedPorch']].astype(int)

## Saving current dataset

We will save current dataset as inputs/datasets/cleaning/masonry_and_porch.parquet.gzip

In [None]:
df.to_parquet('inputs/datasets/cleaning/masonry_and_porch.parquet.gzip', compression='gzip')

### Adding cleaning code to Pipeline
```python
# Fill missing values and immediately convert to integers for specified columns
df['EnclosedPorch'] = df['EnclosedPorch'].fillna(0).astype(int)
df['MasVnrArea'] = df['MasVnrArea'].fillna(0).astype(int)

```

## Next step is cleaning Overall Quality and Condition of house-cleaning and fixing data in garages