# Notebook 08 - Overall Quality and Overall Condition data cleaning and fixing

## Objectives
* Clean data
* Evaluate and process missing data
* Fix potential issues with data in feature (OverallCond, OverallQual)

## Inputs
* inputs/datasets/cleaning/masonry_and_porch.parquet.gzip

## Outputs
* Clean and fix (missing and potentially wrong) data in given column
* After cleaning is completed, we will save current dataset in inputs/datasets/cleaning/clean_finished.parquet.gzip

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

We need to check current working directory

In [None]:
current_dir

We can see that current is **jupyter_notebooks**, as current notebook is in subfolder. We will go one step up to parent directory, what will be our project main directory.
Print out to confirm working directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_parquet("inputs/datasets/cleaning/masonry_and_porch.parquet.gzip")
df.head()

## Exploring Data

We will get all features that are missing data as a list

In [None]:
print("Overall Condition of house records missing: ", df['OverallCond'].isnull().sum())
print("Overall Quality of house records missing: ", df['OverallQual'].isnull().sum())

We can see that there is no missing data

Now we will check what is data types of these features

In [None]:
df[['OverallCond', 'OverallQual']].dtypes

Both Features have no missing data and are integers.

We will not create one more notebook to check last feature - Wood deck Area, we will include it here
Lets check how many features are missing data

In [None]:
df['WoodDeckSF'].isnull().sum()

We have in total 1305 records with missing data, what means that quantity of buildings that has deck, is very low.
We will replace missing values with 0

In [None]:
df.loc[:, 'WoodDeckSF'] = df['WoodDeckSF'].fillna(value=0)

Let's check what is data type for this feature

In [None]:
df['WoodDeckSF'].dtypes

This is float, we will change it to integer

In [None]:
df['WoodDeckSF'] = df['WoodDeckSF'].astype(int)

We have Completed cleaning all dataset.

Now LOUD and PROUD we will check last feature in here - Sales Price. Is there any missing data and what type it is

In [None]:
print("Sales Price is missing data in so many records: ", df['SalePrice'].isnull().sum())
print("Sales Price is this type of data: ", df['SalePrice'].dtypes)

Well DONE! All features are valid and correct.

# Exporting dataframe for analysis, modeling, etc

We will export it in inputs/datasets/cleaning/clean_finished.parquet.gzip

In [None]:
df.to_parquet("inputs/datasets/cleaning/clean_finished.parquet.gzip", compression='gzip')

### Adding Cleaning code to pipeline

```python
# Fill missing values and immediately convert to integers for specified columns
df.loc[:, 'WoodDeckSF'] = df['WoodDeckSF'].fillna(value=0)
df['WoodDeckSF'] = df['WoodDeckSF'].astype(int)
```