# Notebook 05 - Kitchen Quality data cleaning and fixing

## Objectives
* Clean data
* Evaluate and process missing data
* Fix potential issues with data in feature (KitchenQual)

## Inputs
* inputs/datasets/cleaning/garages_and_build_years.csv

## Outputs
* Clean and fix (missing and potentially wrong) data in given column
* After cleaning is completed, we will save current dataset in inputs/datasets/cleaning/kitchen.csv

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [3]:
current_dir = os.getcwd()
current_dir

We need to check current working directory

In [4]:
current_dir

We can see that current is **jupyter_notebooks**, as current notebook is in subfolder. We will go one step up to parent directory, what will be our project main directory.
Print out to confirm working directory

In [5]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [6]:
import pandas as pd

df = pd.read_csv("inputs/datasets/cleaning/garages_and_build_years.csv")
df.head()

## Exploring Data

We will get all features that are missing data as a list

In [7]:
print("Kitchen feature data tyoe is:", df['KitchenQual'].dtypes)

Lets chek if there is any missing data

In [8]:
df['KitchenQual'].isnull().sum()


OK, we can see that there is no missing data.
What tells us, that this feature should be correct

Now we have just to encode it as integer and save it as inputs/datasets/cleaning/kitchen.csv

In [9]:
import joblib
from sklearn.preprocessing import LabelEncoder

# Creating an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fitting and transforming the column to encode
df['KitchenQual'] = label_encoder.fit_transform(df['KitchenQual'])
joblib.dump(label_encoder, 'models/joblib/kitchen_qual.joblib')


# Showing the mapping
mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Current encoding: ", mapping)

## Saving current dataset

We will save current dataset as inputs/datasets/cleaning/kitchen.csv

In [10]:
df.to_csv('inputs/datasets/cleaning/kitchen.csv', index=False)

## Next step is cleaning Lot area and Frontage - cleaning and fixing data in garages